Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

getting_started.rst 7.3 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
  1. Evaluating Pre-trained Models
  2. =============================
  3. First, download a pre-trained model along with its vocabularies:
  4. .. code-block:: console
  5. > curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
  6. This model uses a `Byte Pair Encoding (BPE)
  7. vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply
  8. the encoding to the source text before it can be translated. This can be
  9. done with the
  10. `apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py>`__
  11. script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is
  12. used as a continuation marker and the original text can be easily
  13. recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe``
  14. flag to :ref:`generate.py`. Prior to BPE, input text needs to be tokenized
  15. using ``tokenizer.perl`` from
  16. `mosesdecoder <https://github.com/moses-smt/mosesdecoder>`__.
  17. Let's use :ref:`interactive.py` to generate translations
  18. interactively. Here, we use a beam size of 5:
  19. .. code-block:: console
  20. > MODEL_DIR=wmt14.en-fr.fconv-py
  21. > python interactive.py \
  22. --path $MODEL_DIR/model.pt $MODEL_DIR \
  23. --beam 5 --source-lang en --target-lang fr
  24. | loading model(s) from wmt14.en-fr.fconv-py/model.pt
  25. | [en] dictionary: 44206 types
  26. | [fr] dictionary: 44463 types
  27. | Type the input sentence and press return:
  28. > Why is it rare to discover new marine mam@@ mal species ?
  29. O Why is it rare to discover new marine mam@@ mal species ?
  30. H -0.1525060087442398 Pourquoi est @-@ il rare de découvrir de nouvelles espèces de mammifères marins ?
  31. P -0.2221 -0.3122 -0.1289 -0.2673 -0.1711 -0.1930 -0.1101 -0.1660 -0.1003 -0.0740 -0.1101 -0.0814 -0.1238 -0.0985 -0.1288
  32. This generation script produces three types of outputs: a line prefixed
  33. with *O* is a copy of the original source sentence; *H* is the
  34. hypothesis along with an average log-likelihood; and *P* is the
  35. positional score per token position, including the
  36. end-of-sentence marker which is omitted from the text.
  37. See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a
  38. full list of pre-trained models available.
  39. Training a New Model
  40. ====================
  41. The following tutorial is for machine translation. For an example of how
  42. to use Fairseq for other tasks, such as :ref:`language modeling`, please see the
  43. ``examples/`` directory.
  44. Data Pre-processing
  45. -------------------
  46. Fairseq contains example pre-processing scripts for several translation
  47. datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT
  48. 2014 (English-German). To pre-process and binarize the IWSLT dataset:
  49. .. code-block:: console
  50. > cd examples/translation/
  51. > bash prepare-iwslt14.sh
  52. > cd ../..
  53. > TEXT=examples/translation/iwslt14.tokenized.de-en
  54. > python preprocess.py --source-lang de --target-lang en \
  55. --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  56. --destdir data-bin/iwslt14.tokenized.de-en
  57. This will write binarized data that can be used for model training to
  58. ``data-bin/iwslt14.tokenized.de-en``.
  59. Training
  60. --------
  61. Use :ref:`train.py` to train a new model. Here a few example settings that work
  62. well for the IWSLT 2014 dataset:
  63. .. code-block:: console
  64. > mkdir -p checkpoints/fconv
  65. > CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
  66. --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
  67. --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
  68. By default, :ref:`train.py` will use all available GPUs on your machine. Use the
  69. ``CUDA_VISIBLE_DEVICES`` environment variable to select specific GPUs and/or to
  70. change the number of GPU devices that will be used.
  71. Also note that the batch size is specified in terms of the maximum
  72. number of tokens per batch (``--max-tokens``). You may need to use a
  73. smaller value depending on the available GPU memory on your system.
  74. Generation
  75. ----------
  76. Once your model is trained, you can generate translations using
  77. :ref:`generate.py` **(for binarized data)** or
  78. :ref:`interactive.py` **(for raw text)**:
  79. .. code-block:: console
  80. > python generate.py data-bin/iwslt14.tokenized.de-en \
  81. --path checkpoints/fconv/checkpoint_best.pt \
  82. --batch-size 128 --beam 5
  83. | [de] dictionary: 35475 types
  84. | [en] dictionary: 24739 types
  85. | data-bin/iwslt14.tokenized.de-en test 6750 examples
  86. | model fconv
  87. | loaded checkpoint trainings/fconv/checkpoint_best.pt
  88. S-721 danke .
  89. T-721 thank you .
  90. ...
  91. To generate translations with only a CPU, use the ``--cpu`` flag. BPE
  92. continuation markers can be removed with the ``--remove-bpe`` flag.
  93. Advanced Training Options
  94. =========================
  95. Large mini-batch training with delayed updates
  96. ----------------------------------------------
  97. The ``--update-freq`` option can be used to accumulate gradients from
  98. multiple mini-batches and delay updating, creating a larger effective
  99. batch size. Delayed updates can also improve training speed by reducing
  100. inter-GPU communication costs and by saving idle time caused by variance
  101. in workload across GPUs. See `Ott et al.
  102. (2018) <https://arxiv.org/abs/1806.00187>`__ for more details.
  103. To train on a single GPU with an effective batch size that is equivalent
  104. to training on 8 GPUs:
  105. .. code-block:: console
  106. > CUDA_VISIBLE_DEVICES=0 python train.py --update-freq 8 (...)
  107. Training with half precision floating point (FP16)
  108. --------------------------------------------------
  109. .. note::
  110. FP16 training requires a Volta GPU and CUDA 9.1 or greater
  111. Recent GPUs enable efficient half precision floating point computation,
  112. e.g., using `Nvidia Tensor Cores
  113. <https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__.
  114. Fairseq supports FP16 training with the ``--fp16`` flag:
  115. .. code-block:: console
  116. > python train.py --fp16 (...)
  117. Lazily loading large training datasets
  118. --------------------------------------
  119. By default fairseq loads the entire training set into system memory. For large
  120. datasets, the ``--lazy-load`` option can be used to instead load batches on-demand.
  121. For optimal performance, use the ``--num-workers`` option to control the number
  122. of background processes that will load batches.
  123. Distributed training
  124. --------------------
  125. Distributed training in fairseq is implemented on top of ``torch.distributed``.
  126. The easiest way to launch jobs is with the `torch.distributed.launch
  127. <https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.
  128. For example, to train a large English-German Transformer model on 2 nodes each
  129. with 8 GPUs (in total 16 GPUs), run the following command on each node,
  130. replacing ``node_rank=0`` with ``node_rank=1`` on the second node:
  131. .. code-block:: console
  132. > python -m torch.distributed.launch --nproc_per_node=8 \
  133. --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
  134. --master_port=1234 \
  135. train.py data-bin/wmt16_en_de_bpe32k \
  136. --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
  137. --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
  138. --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
  139. --lr 0.0005 --min-lr 1e-09 \
  140. --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  141. --max-tokens 3584 \
  142. --fp16
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...