Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Ekaterina 6abce0ce94
Update readme.md
1 year ago
..
6df40dc65d
data preparation for Zenodo
1 year ago
a948d6acc1
data preparation for Zenodo
1 year ago
88e1f27191
add data
2 years ago
6abce0ce94
Update readme.md
1 year ago

readme.md

You have to be logged in to leave a comment. Sign In

All the data is stored using DVC.

[Code4ML.dvc] represents the Code4ML corpus. It contains 4 files:

file name shape description column names
blocks_all.csv (2743211, 3) code snippets ['kernel_id', 'code_block_id', 'code_block']
competitions.csv (1147, 8) Kaggle competitions information ['description', 'datatype', 'comp_name', 'comp_type', 'subtitle', 'EvaluationAlgorithmAbbreviation', 'data_sources', 'metrictype']
kernels_meta.csv (23104, 6) kernels information ['kaggle_score', 'kaggle_comments', 'kaggle_upvotes', 'kernel_link', 'kernel_id', 'comp_name']
markup_data.csv (10247, 7) manually labeled code snippets ['code_block', 'data_format', 'graph_vertex_id', 'errors', 'marks', 'kernel_id']

Snippets information (markup_data.csv, blocks_all.csv) can be mapped with kernels metadata via 'kernel_id'.

Kernels metadata is linked to Kaggle competitions information through 'comp_name'.

data_updated.dvc contains 15 files:

file name shape description column names
dataset.csv (1158, 20) information about 1158 Kaggle competitions, including descriptions and metadata ['description', 'metric', 'datatype', 'problem', 'problemtype', 'subject', 'comp_name', 'comp_type', 'Summarized','Summarized_flag','comp_url','ProblemType','number of columns (for tabular)','Data Format','Target Column(s) Name', 'Subtitle', 'ref', 'subtitle','EvaluationAlgorithmAbbreviation', 'data_sources']
actual_graph_2022-06-01.csv (92, 2) mapping of graph vertex id to graph vertex value and subclass (id == index) ['graph_vertex', 'graph_vertex_subclass']
code_blocks_upto_21.csv (2567648, 10) code blocks up to 2021 year and their metadata ['kaggle_score', 'kaggle_comments', 'kaggle_upvotes', 'kaggle_link','kaggle_id', 'data_sources', 'code_block', 'code_block_id', 'subtitle','comp_name']
code_blocks_21.csv (175967, 10) code blocks of 2021 year and their metadata ['kaggle_score', 'kaggle_comments', 'kaggle_upvotes', 'kaggle_link','kaggle_id', 'data_sources', 'code_block', 'code_block_id', 'subtitle','comp_name']
markup_data_20220329.csv (2 versions) (11721, 11) information about labeled code blocks (the ones with defined graph vertex id) from 54 competitions (incl competition #0) ['code_block', 'data_format', 'graph_vertex_id', 'errors', 'marks', 'kaggle_id', 'competition_id', 'comp_name', 'username', 'created_on']
updated_labeled_data.csv (9593, 25) information about labeled code blocks (the ones with defined graph vertex id) including competitions information ['ref_link', 'comp_name', 'comp_type', 'description', 'metric', 'comp_name_m', 'comp_type_m', 'description_m', 'metric_m', 'datatype', 'problem', 'problemtype', 'subject', 'graph_vertex_id', 'graph_vertex', 'graph_vertex_subclass', 'code_block_id', 'code_block', 'data_format', 'errors', 'marks', 'kaggle_id', 'competition_id', 'username', 'created_on']
piplines_20220415.csv (3 versions) (428, 6) ML pipelines created for 58 competitions ['kaggle_id', 'graph_vertex_subclass', 'competition_id', 'comp_name', 'code_block_id', 'len']
name2embid.csv (937, 2) Mapping competition name - embedding id ['comp_name', 'embid']
metrics_mapping.csv (62, 2) old metric - new metric ['Metric_name', 'Class_name']
embs_summarized_description.pt description embeddings
data_w_subtitles.csv (512, 19) 512 competitions and their sybtitles and other data (old dataset)
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...