Are you sure you want to delete this access key?
All the data is stored using DVC.
[Code4ML.dvc] represents the Code4ML corpus. It contains 4 files:
file name | shape | description | column names |
---|---|---|---|
blocks_all.csv | (2743211 , 3) |
code snippets | ['kernel_id', 'code_block_id', 'code_block'] |
competitions.csv | (1147, 8) | Kaggle competitions information | ['description', 'datatype', 'comp_name', 'comp_type', 'subtitle', 'EvaluationAlgorithmAbbreviation', 'data_sources', 'metrictype'] |
kernels_meta.csv | (23104, 6) | kernels information | ['kaggle_score', 'kaggle_comments', 'kaggle_upvotes', 'kernel_link', 'kernel_id', 'comp_name'] |
markup_data.csv | (10247, 7) | manually labeled code snippets | ['code_block', 'data_format', 'graph_vertex_id', 'errors', 'marks', 'kernel_id'] |
Snippets information (markup_data.csv, blocks_all.csv) can be mapped with kernels metadata via 'kernel_id'.
Kernels metadata is linked to Kaggle competitions information through 'comp_name'.
data_updated.dvc contains 15 files:
file name | shape | description | column names |
---|---|---|---|
dataset.csv | (1158, 20) | information about 1158 Kaggle competitions, including descriptions and metadata | ['description', 'metric', 'datatype', 'problem', 'problemtype', 'subject', 'comp_name', 'comp_type', 'Summarized','Summarized_flag','comp_url','ProblemType','number of columns (for tabular)','Data Format','Target Column(s) Name', 'Subtitle', 'ref', 'subtitle','EvaluationAlgorithmAbbreviation', 'data_sources'] |
actual_graph_2022-06-01.csv | (92, 2) | mapping of graph vertex id to graph vertex value and subclass (id == index) | ['graph_vertex', 'graph_vertex_subclass'] |
code_blocks_upto_21.csv | (2567648 , 10) |
code blocks up to 2021 year and their metadata | ['kaggle_score', 'kaggle_comments', 'kaggle_upvotes', 'kaggle_link','kaggle_id', 'data_sources', 'code_block', 'code_block_id', 'subtitle','comp_name'] |
code_blocks_21.csv | (175967, 10) | code blocks of 2021 year and their metadata | ['kaggle_score', 'kaggle_comments', 'kaggle_upvotes', 'kaggle_link','kaggle_id', 'data_sources', 'code_block', 'code_block_id', 'subtitle','comp_name'] |
markup_data_20220329.csv (2 versions) | (11721, 11) | information about labeled code blocks (the ones with defined graph vertex id) from 54 competitions (incl competition #0) | ['code_block', 'data_format', 'graph_vertex_id', 'errors', 'marks', 'kaggle_id', 'competition_id', 'comp_name', 'username', 'created_on'] |
updated_labeled_data.csv | (9593, 25) | information about labeled code blocks (the ones with defined graph vertex id) including competitions information | ['ref_link', 'comp_name', 'comp_type', 'description', 'metric', 'comp_name_m', 'comp_type_m', 'description_m', 'metric_m', 'datatype', 'problem', 'problemtype', 'subject', 'graph_vertex_id', 'graph_vertex', 'graph_vertex_subclass', 'code_block_id', 'code_block', 'data_format', 'errors', 'marks', 'kaggle_id', 'competition_id', 'username', 'created_on'] |
piplines_20220415.csv (3 versions) | (428, 6) | ML pipelines created for 58 competitions | ['kaggle_id', 'graph_vertex_subclass', 'competition_id', 'comp_name', 'code_block_id', 'len'] |
name2embid.csv | (937, 2) | Mapping competition name - embedding id | ['comp_name', 'embid'] |
metrics_mapping.csv | (62, 2) | old metric - new metric | ['Metric_name', 'Class_name'] |
embs_summarized_description.pt | description embeddings | ||
data_w_subtitles.csv | (512, 19) | 512 competitions and their sybtitles and other data (old dataset) |
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?