Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
In this workshop, we will learn about word embeddings, train our own word embeddings, and then use the word embeddings for some cool applications.
From the root of the repository:
$ pip install -r requirements.txt ; dvc pull -r origin
Open notebook.ipynb
and press the 'Open in Colab' button, and you should be able to run the cells.
Once you have the environment setup, change your directory to src/
. You can run all your code from here.
There are two ways Word2Vec uses contextual information to pre-process the dataset:
You can play around with either pre-processing technique by running:
$ python -m data.cbow # or
$ python -m data.skipgram
Although the objective is not exactly the same, we use a feedforward autoencoder for the model. We are interested in the encoding weights of this architecture, since these are the actual word vectors that we desire.
You can play around with the untrained model by running:
$ python -m w2v.arch
Finally, you can execute the entire training sequence by running:
$ python -m w2v.train
You can modify hyperparameters by updating src/const.py
. You can get an (ugly) graph of the embeddings if you set const.VECTOR_DIMENSIONS = 2
and run model.visualize(dataset.words)
Gensim is a brilliant library that allows us to abstract implementation details for Word2Vec, with optimizations for both training and inference.
You can run training code for Gensim using:
$ python -m gs.train
However, even this is dependent on good data, so we can skip that process by using Google's weights:
$ python -m gs.pretrained
Note: These pre-trained weights aren't that great either. Word embeddings generally aren't SoTA anymore. I recommend doing some research and finding the best consumer-grade solution available for your task, especially if you want to use embeddings in a production setting (as of 15 Sep 2023, Sentence Transformers is a nice option).
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?