Are you sure you want to delete this access key?
This is my version of code2vec work for Python3. It works only on keras implementation as for now. Some basic changes done:
The rest of the README is almost the same with the original code2vec, but with some changes considering my implemetation. You should understand that the original work has a lot more opportunities (including already trained on Java models) so I really recommend working with it. Here I leave some dependencies on file and folder names, but anyone can get through them.
A neural network for learning distributed representations of code. This is made on top of the implementation of the model described in:
Uri Alon, Meital Zilberstein, Omer Levy and Eran Yahav, "code2vec: Learning Distributed Representations of Code", POPL'2019 [PDF]
October 2018 - The paper was accepted to POPL'2019!
April 2019 - The talk video is available here.
_July 2019 - Add tf.keras
model implementation
An online demo is available at https://code2vec.org/.
On Ubuntu:
python3 --version
python3 -c 'import tensorflow as tf; print(tf.__version__)'
nvcc --version
cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
git clone https://github.com/Kirili4ik/code2vec
cd code2vec
In order to have a preprocessed dataset to train a network on you should create a new dataset of your own. It consists from 3 folders train, test and validation.
In order to create and preprocess a new dataset (for example, to compare code2vec to another model on another dataset):
source preprocess.sh
You should train a new model using a preprocessed dataset.
To train a model from scratch:
source train.sh
Once the score on the validation set stops improving over time, you can stop the training process (by killing it) and pick the iteration that performed the best on the validation set. Suppose that iteration #8 is our chosen model, run:
python3 code2vec.py --framework keras --load models/my_first_model/saved_model --test data/my_dataset/my_dataset.test.c2v
To manually examine a trained model, run:
source my_predict.sh
After the model loads, follow the instructions and edit the file Input.py and enter a Python method or code snippet, and examine the model's predictions and attention scores.
Follow Step 4 and embedding for your snippet will be in EMBEDDINGS.txt file.
Run command:
python3 my_find_synonim.py --label 'linear|algebra' Or any other tag and look at the closest to it.
Changing hyper-parameters is possible by editing the file config.py.
Here are some of the parameters and their description:
The max number of epochs to train the model. Stopping earlier must be done manually (kill).
After how many training iterations a model should be saved.
Batch size in training.
Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.
Number of words with highest scores in $ y_hat $ to consider during prediction and evaluation.
Number of batches (during training / evaluating) to complete between two progress-logging records.
Number of training batches to complete between model evaluations on the test set.
The number of threads enqueuing examples to the reader queue.
Size of buffer in reader to shuffle example within during training. Bigger buffer allows better randomness, but requires more amount of memory and may harm training throughput.
The buffer size (in bytes) of the CSV dataset reader.
The number of contexts to use in each example.
1301136
The max size of the token vocabulary.
The max size of the target words vocabulary.
The max size of the path vocabulary.
Default embedding size to be used for token and path if not specified otherwise.
Embedding size for tokens.
Embedding size for paths.
Size of code vectors.
Embedding size for target words.
Keep this number of newest trained versions during training.
Dropout rate used during training.
Whether to treat <OOV>
and <PAD>
as two different special tokens whenever possible.
Code2vec supports the following features:
If you wish to keep a trained model for inference only (without the ability to continue training it) you can release the model using:
python3 code2vec.py --load models/my_first_model/saved_model --release
This will save a copy of the trained model with the '.release' suffix. A "released" model usually takes 3x less disk space.
These saved embeddings are saved without subtoken-delimiters ("toLower" is saved as "tolower").
In order to export embeddings from a trained model, use:
source my_get_embeddings.sh
This creates 2 files tokens.txt and targets.txt
This saves the tokens/targets embedding matrices in word2vec format to the specified text file, in which: the first line is: <vocab_size> <dimension> and each of the following lines contains: <word> <float_1> <float_2> ... <float_dimension>
These word2vec files can be manually parsed or easily loaded and inspected using the gensim python package:
python3
>>> from gensim.models import KeyedVectors as word2vec
>>> vectors_text_path = 'models/java14_model/targets.txt' # or: `models/java14_model/tokens.txt'
>>> model = word2vec.load_word2vec_format(vectors_text_path, binary=False)
>>> model.most_similar(positive=['equals', 'to|lower']) # or: 'tolower', if using the downloaded embeddings
>>> model.most_similar(positive=['download', 'send'], negative=['receive'])
code2vec: Learning Distributed Representations of Code
@article{alon2019code2vec,
author = {Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran},
title = {Code2Vec: Learning Distributed Representations of Code},
journal = {Proc. ACM Program. Lang.},
issue_date = {January 2019},
volume = {3},
number = {POPL},
month = jan,
year = {2019},
issn = {2475-1421},
pages = {40:1--40:29},
articleno = {40},
numpages = {29},
url = {http://doi.acm.org/10.1145/3290353},
doi = {10.1145/3290353},
acmid = {3290353},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Big Code, Distributed Representations, Machine Learning},
}
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?