You have to be logged in to leave a comment.

Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)

This page contains pointers to pre-trained models as well as instructions on how to train new models for our paper

Citation:

@inproceedings{wu2018pay,
  title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
  author = {Felix Wu and Angela Fan and Alexei Baevski and Yann Dauphin and Michael Auli},
  booktitle = {International Conference on Learning Representations},
  year = {2019},
  url = {https://openreview.net/forum?id=SkVhlh09tX},
}

Translation

Pre-trained models

For some datasets we release models without GLUs which are faster at inference.

Description	Dataset	Model	Test set(s)
LightConv (without GLUs)	IWSLT14 German-English	download (.tar.bz2)	IWSLT14 test: download (.tar.bz2)
DynamicConv (without GLUs)	IWSLT14 German-English	download (.tar.bz2)	IWSLT14 test: download (.tar.bz2)
LightConv (without GLUs)	WMT16 English-German	download (.tar.bz2)	newstest2014 (shared vocab): download (.tar.bz2)
DynamicConv (without GLUs)	WMT16 English-German	download (.tar.bz2)	newstest2014 (shared vocab): download (.tar.bz2)
LightConv	WMT16 English-German	download (.tar.bz2)	newstest2014 (shared vocab): download (.tar.bz2)
DynamicConv	WMT16 English-German	download (.tar.bz2)	newstest2014 (shared vocab): download (.tar.bz2)
LightConv	WMT14 English-French	download (.tar.bz2)	newstest2014: download (.tar.bz2)
DynamicConv	WMT14 English-French	download (.tar.bz2)	newstest2014: download (.tar.bz2)
LightConv	WMT17 Chinese-English	download (.tar.bz2)	newstest2017: download (.tar.bz2)
DynamicConv	WMT17 Chinese-English	download (.tar.bz2)	newstest2017: download (.tar.bz2)

Preprocessing the training datasets

Please follow the instructions in examples/translation/README.md to preprocess the data.

Training and evaluation options:

To use the model without GLU, please set --encoder-glu 0 --decoder-glu 0. For LightConv, please use --encoder-conv-type lightweight --decoder-conv-type lightweight, otherwise the default is DynamicConv. For best BLEU results, lenpen may need to be manually tuned.

IWSLT14 De-En

Training and evaluating DynamicConv (without GLU) on a GPU:

# Training
SAVE="save/dynamic_conv_iwslt"
mkdir -p $SAVE 
CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
    --clip-norm 0 --optimizer adam --lr 0.0005 \
    --source-lang de --target-lang en --max-tokens 4000 --no-progress-bar \
    --log-interval 100 --min-lr '1e-09' --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --lr-scheduler inverse_sqrt \
    --ddp-backend=no_c10d \
    --max-update 50000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
    --adam-betas '(0.9, 0.98)' --keep-last-epochs 10 \
    -a lightconv_iwslt_de_en --save-dir $SAVE \
    --dropout 0.3 --attention-dropout 0.1 --weight-dropout 0.1 \
    --encoder-glu 0 --decoder-glu 0
python scripts/average_checkpoints.py --inputs $SAVE \
    --num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt"

# Evaluation
CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/iwslt14.tokenized.de-en --path "${SAVE}/checkpoint_last10_avg.pt" --batch-size 128 --beam 4 --remove-bpe --lenpen 1 --gen-subset test --quiet

WMT16 En-De

Training and evaluating DynamicConv (with GLU) on WMT16 En-De using cosine scheduler on one machine with 8 V100 GPUs:

# Training
SAVE="save/dynamic_conv_wmt16en2de"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 train.py \
    data-bin/wmt16_en_de_bpe32k --fp16  --log-interval 100 --no-progress-bar \
    --max-update 30000 --share-all-embeddings --optimizer adam \
    --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt \
    --clip-norm 0.0 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \
    --ddp-backend=no_c10d --max-tokens 3584 \
    --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
    --lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
    --t-mult 1 --lr-period-updates 20000 \
    --arch lightconv_wmt_en_de_big --save-dir $SAVE \
    --dropout 0.3 --attention-dropout 0.1 --weight-dropout 0.1 \
    --encoder-glu 1 --decoder-glu 1

# Evaluation
CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/wmt16.en-de.joined-dict.newstest2014 --path "${SAVE}/checkpoint_best.pt" --batch-size 128 --beam 5 --remove-bpe --lenpen 0.5 --gen-subset test > wmt16_gen.txt
bash scripts/compound_split_bleu.sh wmt16_gen.txt

WMT14 En-Fr

Training DynamicConv (with GLU) on WMT14 En-Fr using cosine scheduler on one machine with 8 V100 GPUs:

# Training
SAVE="save/dynamic_conv_wmt14en2fr"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 train.py \
    data-bin/wmt14_en_fr --fp16  --log-interval 100 --no-progress-bar \
    --max-update 30000 --share-all-embeddings --optimizer adam \
    --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt \
    --clip-norm 0.0 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \
    --ddp-backend=no_c10d --max-tokens 3584 \
    --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
    --lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
    --t-mult 1 --lr-period-updates 70000 \
    --arch lightconv_wmt_en_fr_big --save-dir $SAVE \
    --dropout 0.1 --attention-dropout 0.1 --weight-dropout 0.1 \
    --encoder-glu 1 --decoder-glu 1

# Evaluation
CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/wmt14.en-fr.joined-dict.newstest2014 --path "${SAVE}/checkpoint_best.pt" --batch-size 128 --beam 5 --remove-bpe --lenpen 0.9 --gen-subset test

Tip!

Press p or to see the previous file or, n or to see the next file

README.md 8.4 KB

Permalink History Raw

Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)

Citation:

Translation

Pre-trained models

Preprocessing the training datasets

Training and evaluation options:

IWSLT14 De-En

WMT16 En-De

WMT14 En-Fr

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Guy / fairseq

README.md 8.4 KB Permalink History Raw

Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)

Citation:

Translation

Pre-trained models

Preprocessing the training datasets

Training and evaluation options:

IWSLT14 De-En

WMT16 En-De

WMT14 En-Fr

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Guy
/
fairseq

README.md 8.4 KB

Permalink History Raw