Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  research Task:  sequence-to-sequence language modeling Data Domain:  nlp Integration:  arxiv
Myle Ott d82945d183
Allow larger maxlen (fixes #100) (#101)
6 years ago
52b6119a53
spelling
6 years ago
6 years ago
e734b0fa58
Initial commit
6 years ago
e1f49695ee
Rename LabelSmoothedCrossEntropy to LabelSmoothedNLLLoss
6 years ago
d92ce54c65
Ignore generated files for temporal convolution tbc
6 years ago
a15acdb062
Architecture settings and readme updates
6 years ago
e734b0fa58
Initial commit
6 years ago
e734b0fa58
Initial commit
6 years ago
2c18c27365
Update README with new models
6 years ago
e734b0fa58
Initial commit
6 years ago
7da4e0629e
Support deprecation of volatile Variables in latest PyTorch
6 years ago
2ef422f65f
Update README with interactive.py and fix it
6 years ago
3e3529e587
Remove Python3.6 format string from preprocess.py (fixes #60) (#61)
6 years ago
18a6d85c88
Add explicit dimension to softmax calls
6 years ago
cb0d7b2ad1
Fix flake8 warnings
6 years ago
18a6d85c88
Add explicit dimension to softmax calls
6 years ago
81ace092ef
Fix max_positions calculation in train.py
6 years ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

Introduction

FAIR Sequence-to-Sequence Toolkit (PyTorch)

This is a PyTorch version of fairseq, a sequence-to-sequence learning toolkit from Facebook AI Research. The original authors of this reimplementation are (in no particular order) Sergey Edunov, Myle Ott, and Sam Gross. The toolkit implements the fully convolutional model described in Convolutional Sequence to Sequence Learning and features multi-GPU training on a single machine as well as fast beam search generation on both CPU and GPU. We provide pre-trained models for English to French and English to German translation.

Model

Citation

If you use the code in your paper, then please cite it as:

@inproceedings{gehring2017convs2s,
  author    = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
  title     = "{Convolutional Sequence to Sequence Learning}",
  booktitle = {Proc. of ICML},
  year      = 2017,
}

Requirements and Installation

  • A computer running macOS or Linux
  • For training new models, you'll also need a NVIDIA GPU and NCCL
  • Python version 3.6
  • A PyTorch installation

Currently fairseq-py requires PyTorch version >= 0.3.0. Please follow the instructions here: https://github.com/pytorch/pytorch#installation.

If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run.

After PyTorch is installed, you can install fairseq-py with:

pip install -r requirements.txt
python setup.py build
python setup.py develop

Quick Start

The following command-line tools are available:

  • python preprocess.py: Data pre-processing: build vocabularies and binarize training data
  • python train.py: Train a new model on one or multiple GPUs
  • python generate.py: Translate pre-processed data with a trained model
  • python interactive.py: Translate raw text with a trained model
  • python score.py: BLEU scoring of generated translations against reference translations

Evaluating Pre-trained Models

First, download a pre-trained model along with its vocabularies:

$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -

This model uses a Byte Pair Encoding (BPE) vocabulary, so we'll have to apply the encoding to the source text before it can be translated. This can be done with the apply_bpe.py script using the wmt14.en-fr.fconv-cuda/bpecodes file. @@ is used as a continuation marker and the original text can be easily recovered with e.g. sed s/@@ //g or by passing the --remove-bpe flag to generate.py. Prior to BPE, input text needs to be tokenized using tokenizer.perl from mosesdecoder.

Let's use python interactive.py to generate translations interactively. Here, we use a beam size of 5:

$ MODEL_DIR=wmt14.en-fr.fconv-py
$ python interactive.py \
 --path $MODEL_DIR/model.pt $MODEL_DIR \
 --beam 5
| loading model(s) from wmt14.en-fr.fconv-py/model.pt
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
| Type the input sentence and press return:
> Why is it rare to discover new marine mam@@ mal species ?
O       Why is it rare to discover new marine mam@@ mal species ?
H       -0.06429661810398102    Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins ?
A       0 1 3 3 5 6 6 8 8 8 7 11 12

This generation script produces four types of outputs: a line prefixed with S shows the supplied source sentence after applying the vocabulary; O is a copy of the original source sentence; H is the hypothesis along with an average log-likelihood; and A is the attention maxima for each word in the hypothesis, including the end-of-sentence marker which is omitted from the text.

Check below for a full list of pre-trained models available.

Training a New Model

Data Pre-processing

The fairseq-py source distribution contains an example pre-processing script for the IWSLT 2014 German-English corpus. Pre-process and binarize the data as follows:

$ cd data/
$ bash prepare-iwslt14.sh
$ cd ..
$ TEXT=data/iwslt14.tokenized.de-en
$ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/iwslt14.tokenized.de-en

This will write binarized data that can be used for model training to data-bin/iwslt14.tokenized.de-en.

Training

Use python train.py to train a new model. Here a few example settings that work well for the IWSLT 2014 dataset:

$ mkdir -p checkpoints/fconv
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

By default, python train.py will use all available GPUs on your machine. Use the CUDA_VISIBLE_DEVICES environment variable to select specific GPUs and/or to change the number of GPU devices that will be used.

Also note that the batch size is specified in terms of the maximum number of tokens per batch (--max-tokens). You may need to use a smaller value depending on the available GPU memory on your system.

Generation

Once your model is trained, you can generate translations using python generate.py (for binarized data) or python interactive.py (for raw text):

$ python generate.py data-bin/iwslt14.tokenized.de-en \
  --path checkpoints/fconv/checkpoint_best.pt \
  --batch-size 128 --beam 5
  | [de] dictionary: 35475 types
  | [en] dictionary: 24739 types
  | data-bin/iwslt14.tokenized.de-en test 6750 examples
  | model fconv
  | loaded checkpoint trainings/fconv/checkpoint_best.pt
  S-721   danke .
  T-721   thank you .
  ...

To generate translations with only a CPU, use the --cpu flag. BPE continuation markers can be removed with the --remove-bpe flag.

Pre-trained Models

We provide the following pre-trained fully convolutional sequence-to-sequence models:

In addition, we provide pre-processed and binarized test sets for the models above:

Generation with the binarized test sets can be run in batch mode as follows, e.g. for English-French on a GTX-1080ti:

$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
$ curl https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
$ python generate.py data-bin/wmt14.en-fr.newstest2014  \
  --path data-bin/wmt14.en-fr.fconv-py/model.pt \
  --beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
...
| Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
| Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)

# Scoring with score.py:
$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
$ python score.py --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)

Join the fairseq community

License

fairseq-py is BSD-licensed. The license applies to the pre-trained models as well. We also provide an additional patent grant.

Tip!

Press p or to see the previous file or, n or to see the next file

About

A fork for fairseq, migrated to DVC and used for NLP research.

Publications
View on arXiv  
Collaborators 1

Comments

Loading...