nl2ml

Alexander Levin 023461db4d dvc data updated 39 minutes ago
.dvc 7c97b2a9a3 dvc data fix 41 minutes ago
.ipynb_checkpoints 82b810b974 last checkpoints added 1 week ago
code2vec 97c5b91e60 code2vec folder added 2 weeks ago
data
graph 78e469d2ff Logreg trained on the new regex (graph_v2) 3 days ago
.gitattributes bdb584c714 Get rid of csv in Git LFS 3 weeks ago
.gitignore 62d9460b5b asd 1 hour ago
Comments vs commented code.ipynb a4b1957697 in-code comments classification added 2 weeks ago
README.md a1740b38d7 Update README.md 2 weeks ago
bert_classifier.ipynb 45a0bcfcdb changed names 3 weeks ago
bert_distances.ipynb 45a0bcfcdb changed names 3 weeks ago
data.dvc 023461db4d dvc data updated 39 minutes ago
kaggle.sh fc9e9f9c28 ramazyant files added 2 weeks ago
kaggle_parser.ipynb fc9e9f9c28 ramazyant files added 2 weeks ago
logreg_classifier.ipynb 4b167adea8 Titles added 2 days ago
metrics.csv bcd8b3a0ae LogReg Validation, graph_v2, chunk_size == 40 2 days ago
nl2ml_notebook_parser.py 8ba024a8c5 no message 4 months ago
params.yml bcd8b3a0ae LogReg Validation, graph_v2, chunk_size == 40 2 days ago
predict_tag.ipynb e034e4df1f TODO cleaned 2 days ago
regex.ipynb 78e469d2ff Logreg trained on the new regex (graph_v2) 3 days ago
svm_classifier.ipynb 65746108fc pipelines optimized 1 week ago
svm_train.py 6a5ca0927d print(metrics) 2 days ago

Data Pipeline

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

Source Code Classification

This is an old repo of NL2ML-project of the Laboratory of Big Data Analysis of Higher School of Economics (HSE LAMBDA).

The project page - https://www.notion.so/NL2ML-Corpus-1ed964c08eb049b383c73b9728c3a231

The repo is currently migrating to the HSE LAMBDA GitLab - https://gitlab.com/lambda-hse/nl2ml

Project Goals:

The current short-term goal is to build a model that will be able to classify a source code chunk and to specify where the detected class is exactly in the chunk (tag segmentation).

The global goal is to build a model that will be able to generate code using a text of the task in english.

Contents:

nl2ml_notebook_parser.py - script for parsing Kaggle notebooks and process them to JSON/CSV/Pandas.

bert_distances.ipynb - notebook with expiremints concerning sense of distance between BERT embeddings where input tokens were tokenized source code chunks.

bert_classifier.ipynb - notebook with preprocessing and training pipeline.

regex.ipynb - notebook with creating labels for code chunks with regex

logreg_classifier.ipynb.ipynb - notebook with building logreg on the regex labels with tf-idf