Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Data Domain:  nlp Integration:  dvc git gitlab
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
bd896b3eeb
copy from github;
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
f23a7f09e6
git cloned from https://github.com/whatevernevermindbro/source_code_classification
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
bd896b3eeb
copy from github;
3 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

Source Code Classification

This is a repo of NL2ML-project of the Laboratory of Methods for Big Data Analysis at Higher School of Economics (HSE LAMBDA).The repo is a mirror of HSE LAMBDA GitLab - https://gitlab.com/lambda-hse/nl2ml The project page - https://www.notion.so/NL2ML-Corpus-1ed964c08eb049b383c73b9728c3a231

Project Goals:

The current short-term goal is to build a model that will be able to classify a source code chunk and to specify where the detected class is exactly in the chunk (tag segmentation).

The main goal is to build a model that will be able to generate code getting a text of the task in english as an input.

Contents:

nl2ml_notebook_parser.py - a script for parsing Kaggle notebooks and process them to JSON/CSV/Pandas.

bert_distances.ipynb - a notebook with BERT expiremints concerning sense of distance between BERT embeddings where input tokens were tokenized source code chunks.

bert_classifier.ipynb - a notebook with preprocessing and training BERT-pipeline.

regex.ipynb - a notebook with creating labels for code chunks with regex

logreg_classifier.ipynb.ipynb - a notebook with training logistic regression model on the regex labels with tf-idf and analyzing the outputs

Comments vs commented code.ipynb - a notebook with a model distinguishing NL-comments from commented source code

github_dataset.ipynb - a notebook with opening github_dataset

predict_tag.ipynb - a notebook with predicting class label (tag) with any model

svm_classifier.ipynb - a notebook with training SVM (replaced by svm_train.py) and analyzing SVM outputs

svm_train.py - a script for training SVM model

Tip!

Press p or to see the previous file or, n or to see the next file

About

This is a repo of the Natural Language to Machine Learning (NL2ML) project of the Laboratory of Methods for Big Data Analysis at Higher School of Economics (HSE LAMBDA).

https://www.notion.so/NL2ML-Corpus-1ed964c08eb049b383c73b9728c3a231
Collaborators 1

Comments

Loading...