Porting fairseq wmt19 translation system to transformers

This article is an attempt to document how fairseq wmt19 translation system was ported to transformers .

I was looking for some interesting project to work on and Sam Shleifer suggested I work on porting a high quality translator .

I read the short paper: Facebook FAIR's WMT19 News Translation Task Submission that describes the original system and decided to give it a try.

Initially, I had no idea how to approach this complex project and Sam helped me to break it down to smaller tasks , which was of a great help.

I chose to work with the pre-trained en-ru / ru-en models during porting as I speak both languages. It'd have been much more difficult to work with de-en / en-de pairs as I don't speak German, and being able to evaluate the translation quality by just reading and making sense of the outputs at the advanced stages of the porting process saved me a lot of time.

Also, as I did the initial porting with the en-ru / ru-en models, I was totally unaware that the de-en / en-de models used a merged vocabulary, whereas the former used 2 separate vocabularies of different sizes. So once I did the more complicated work of supporting 2 separate vocabularies, it was trivial to get the merged vocabulary to work.

Let's cheat

The first step was to cheat, of course. Why make a big effort when one can make a little one. So I wrote a short notebook that in a few lines of code provided a proxy to fairseq and emulated transformers API.

If no other things, but basic translation, was required, this would have been enough. But, of course, we wanted to have the full porting, so after having this small victory, I moved onto much harder things.

Preparations

For the sake of this article let's assume that we work under ~/porting , and therefore let's create this directory:

mkdir ~/porting
cd ~/porting

We need to install a few things for this work:

# install fairseq
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install -e .
# install mosesdecoder under fairseq
git clone https://github.com/moses-smt/mosesdecoder
# install fastBPE under fairseq
git clone git@github.com:glample/fastBPE.git
cd fastBPE; g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast; cd -
cd -

# install transformers
git clone https://github.com/huggingface/transformers/
pip install -e .[dev]

Files

As a quick overview, the following files needed to be created and written:

src/transformers/configuration_fsmt.py - a short configuration class.
src/transformers/convert_fsmt_original_pytorch_checkpoint_to_pytorch.py - a complex conversion script.
src/transformers/modeling_fsmt.py - this is where the model architecture is implemented.
src/transformers/tokenization_fsmt.py - a tokenizer code.
tests/test_modeling_fsmt.py - model tests.
tests/test_tokenization_fsmt.py - tokenizer tests.
docs/source/model_doc/fsmt.rst - a doc file.

There are other files that needed to be modified as well, we will talk about those towards the end.

Conversion

One of the most important parts of the porting process is to create a script that will take all the available source data provided by the original developer of the model, which includes a checkpoint with pre-trained weights, model and training configuration, dictionaries and tokenizer support files, and convert them into a new set of model files supported by transformers . You will find the final conversion script here: src/transformers/convert_fsmt_original_pytorch_checkpoint_to_pytorch.py

I started this process by copying one of the existing conversion scripts src/transformers/convert_bart_original_pytorch_checkpoint_to_pytorch.py , gutted most of it out and then gradually added parts to it as I was progressing in the porting process.

During the development I was testing all my code against a local copy of the converted model files, and only at the very end when everything was ready I uploaded the files to 🤗 s3 and then continued testing against the online version.

fairseq model and its support files

Let's first look at what data we get with the fairseq pre-trained model.

We are going to use the convenient torch.hub API, which makes it very easy to deploy models submitted to that hub :

import torch
torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-ru', checkpoint_file='model4.pt', 
               tokenizer='moses', bpe='fastbpe')

This code downloads the pre-trained model and its support files. I found this information at the page corresponding to fairseq on the pytorch hub.

To see what's inside the downloaded files, we have to first hunt down the right folder under ~/.cache .

ls -1 ~/.cache/torch/hub/pytorch_fairseq/

shows:

15bca559d0277eb5c17149cc7e808459c6e307e5dfbb296d0cf1cfe89bb665d7.ded47c1b3054e7b2d78c0b86297f36a170b7d2e7980d8c29003634eb58d973d9
15bca559d0277eb5c17149cc7e808459c6e307e5dfbb296d0cf1cfe89bb665d7.ded47c1b3054e7b2d78c0b86297f36a170b7d2e7980d8c29003634eb58d973d9.json

You may have more than one entry there if you have been using the hub for other models.

Let's make a symlink so that we can easily refer to that obscure cache folder name down the road:

ln -s /code/data/cache/torch/hub/pytorch_fairseq/15bca559d0277eb5c17149cc7e808459c6e307e5dfbb296d0cf1cfe89bb665d7.ded47c1b3054e7b2d78c0b86297f36a170b7d2e7980d8c29003634eb58d973d9 \
~/porting/pytorch_fairseq_model

Note: the path could be different when you try it yourself, since the hash value of the model could change. You will find the right one in ~/.cache/torch/hub/pytorch_fairseq/

If we look inside that folder:

ls -l ~/porting/pytorch_fairseq_model/
total 13646584
-rw-rw-r-- 1 stas stas     532048 Sep  8 21:29 bpecodes
-rw-rw-r-- 1 stas stas     351706 Sep  8 21:29 dict.en.txt
-rw-rw-r-- 1 stas stas     515506 Sep  8 21:29 dict.ru.txt
-rw-rw-r-- 1 stas stas 3493170533 Sep  8 21:28 model1.pt
-rw-rw-r-- 1 stas stas 3493170532 Sep  8 21:28 model2.pt
-rw-rw-r-- 1 stas stas 3493170374 Sep  8 21:28 model3.pt
-rw-rw-r-- 1 stas stas 3493170386 Sep  8 21:29 model4.pt

we have:

model*.pt - 4 checkpoints (pytorch state_dict with all the pre-trained weights, and various other things)
dict.*.txt - source and target dictionaries
bpecodes - special map file used by the tokenizer

We are going to investigate each of these files in the following sections.

How translation systems work

Here is a very brief introduction to how computers translate text nowadays.

Computers can't read text, but can only handle numbers. So when working with text we have to map one or more letters into numbers, and hand those to a computer program. When the program completes it too returns numbers, which we need to convert back into text.

Let's start with two sentences in Russian and English and assign a unique number to each word:

я  люблю следовательно я  существую
10 11    12            10 13

I  love therefore I  am
20 21   22        20 23

The numbers starting with 10 map Russian words to unique numbers. The numbers starting with 20 do the same for English words. If you don't speak Russian, you can still see that the word я (means 'I') repeats twice in the sentence and it gets the same number 10 associated with it. Same goes for I (20), which also repeats twice.

A translation system works in the following stages:

1. [я люблю следовательно я существую] # tokenize sentence into words
2. [10 11 12 10 13]                    # look up words in the input dictionary and convert to ids
3. [black box]                         # machine learning system magic
4. [20 21 22 20 23]                    # look up numbers in the output dictionary and convert to text
5. [I love therefore I am]             # detokenize the tokens back into a sentence

If we combine the first two and the last two steps we get 3 stages:

Encode input : break input text into tokens, create a dictionary (vocab) of these tokens and remap each token into a unique id in that dictionary.
Generate translation : take input numbers, run them through a pre-trained machine learning model which predicts the best translation, and return output numbers.
Decode output : take output numbers, look them up in the target language dictionary, convert them back to text, and finally merge the converted tokens into the translated sentence.

The second stage may return one or several possible translations. In the case of the latter the caller then can choose the most suitable outcome. In this article I will refer to the beam search algorithm , which is one of the ways multiple possible results are searched for. And the size of the beam refers to how many results are returned.

If there is only one result that's requested, the model will choose the one with the highest likelihood probability. If multiple results are requested it will return those results sorted by their probabilities.

Note that this same idea applies to the majority of NLP tasks, and not just translation.

Tokenization

Early systems tokenized sentences into words and punctuation marks. But since many languages have hundreds of thousands of words, it is very taxing to work with huge vocabularies, as it dramatically increases the compute resource requirements and the length of time to complete the task.

As of 2020 there are quite a few different tokenizing methods, but most of the recent ones are based on sub-word tokenization - that is instead of breaking input text down into words, these modern tokenizers break the input text down into word segments and letters, using some kind of training to obtain the most optimal tokenization.

Let's see how this approach helps to reduce memory and computation requirements. If we have an input vocabulary of 6 common words: go, going, speak, speaking, sleep, sleeping - with word-level tokenization we end up with 6 tokens. However, if we break these down into: go, go-ing, speak, speak-ing, etc., then we have only 4 tokens in our vocabulary: go, speak, sleep, ing. This simple change made a 33% improvement! Except, the sub-word tokenizers don't use grammar rules, but they are trained on massive text inputs to find such splits. In this example I used a simple grammar rule as it's easy to understand.

Another important advantage of this approach is when dealing with input text words, that aren't in our vocabulary. For example, let's say our system encounters the word grokking (*), which can't be found in its vocabulary. If we split it into `grokk'-'ing', then the machine learning model might not know what to do with the first part of the word, but it gets a useful insight that 'ing' indicates a continuous tense, so it'll be able to produce a better translation. In such situation the tokenizer will split the unknown segments into segments it knows, in the worst case reducing them to individual letters.

footnote: to grok was coined in 1961 by Robert A. Heinlein in "Stranger in a Strange Land": to understand (something) intuitively or by empathy.

There are many other nuances to why the modern tokenization approach is much more superior than simple word tokenization, which won't be covered in the scope of this article. Most of these systems are very complex to how they do the tokenization, as compared to the simple example of splitting ing endings that was just demonstrated, but the principle is similar.

Tokenizer porting

The first step was to port the encoder part of the tokenizer, where text is converted to ids. The decoder part won't be needed until the very end.

fairseq's tokenizer workings

Let's understand how fairseq 's tokenizer works.

fairseq (*) uses the Byte Pair Encoding (BPE) algorithm for tokenization.

footnote: from here on when I refer to fairseq , I refer to this specific model implementation - the fairseq project itself has dozens of different implementations of different models.

Let's see what BPE does:

import torch
sentence = "Machine Learning is great"
checkpoint_file='model4.pt'
model = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-ru', checkpoint_file=checkpoint_file, tokenizer='moses', bpe='fastbpe')

# encode step by step
tokens = model.tokenize(sentence)
print("tokenize ", tokens)

bpe = model.apply_bpe(tokens)
print("apply_bpe: ", bpe)

bin = model.binarize(bpe)
print("binarize: ", len(bin), bin)

# compare to model.encode - should give us the same output
expected = model.encode(sentence)
print("encode:   ", len(expected), expected)

gives us:

('tokenize ', 'Machine Learning is great')
('apply_bpe: ', 'Mach@@ ine Lear@@ ning is great')
('binarize: ', 7, tensor([10217,  1419,     3,  2515,    21,  1054,     2]))
('encode:   ', 7, tensor([10217,  1419,     3,  2515,    21,  1054,     2]))

You can see that model.encode does tokenize+apply_bpe+binarize - as we get the same output.

The steps were:

tokenize : normally it'd escape apostrophes and do other pre-processing, in this example it just returned the input sentence without any changes
apply_bpe : BPE splits the input into words and sub-words according to its bpecodes file supplied by the tokenizer - we get 6 BPE chunks
binarize : this simply remaps the BPE chunks from the previous step into their corresponding ids in the vocabulary (which is also downloaded with the model)

You can refer to this notebook to see more details.

This is a good time to look inside the bpecodes file. Here is the top of the file:

$ head -15 ~/porting/pytorch_fairseq_model/bpecodes
e n</w> 1423551864
e r 1300703664
e r</w> 1142368899
i n 1130674201
c h 933581741
a n 845658658
t h 811639783
e n 780050874
u n 661783167
s t 592856434
e i 579569900
a r 494774817
a l 444331573
o r 439176406
th e</w> 432025210
[...]

The top entries of this file include very frequent short 1-letter sequences. As we will see in a moment the bottom includes the most common multi-letter sub-words and even full long words.

A special token </w> indicates the end of the word. So in several lines quoted above we find:

e n</w> 1423551864
e r</w> 1142368899
th e</w> 432025210

If the second column doesn't include </w> , it means that this segment is found in the middle of the word and not at the end of it.

The last column declares the number of times this BPE code has been encountered while being trained. The bpecodes file is sorted by this column - so the most common BPE codes are on top.

By looking at the counts we now know that when this tokenizer was trained it encountered 1,423,551,864 words ending in en , 1,142,368,899 words ending in er and 432,025,210 words ending in the . For the latter it most likely means the actual word the , but it would also include words like lathe , loathe , tithe , etc.

These huge numbers also indicate to us that this tokenizer was trained on an enormous amount of text!

If we look at the bottom of the same file:

$ tail -10 ~/porting/pytorch_fairseq_model/bpecodes
4 x 109019
F ische</w> 109018
sal aries</w> 109012
e kt 108978
ver gewal 108978
Sten cils</w> 108977
Freiwilli ge</w> 108969
doub les</w> 108965
po ckets</w> 108953
Gö tz</w> 108943

we see complex combinations of sub-words which are still pretty frequent, e.g. sal aries for 109,012 times! So it got its own dedicated entry in the bpecodes map file.

How apply_bpe does its work? By looking up the various combinations of letters in the bpecodes map file and when finding the longest fitting entry it uses that.

Going back to our example, we saw that it split Machine into: Mach@@ + ine - let's check:

$ grep -i ^mach  ~/porting/pytorch_fairseq_model/bpecodes
mach ine</w> 463985
Mach t 376252
Mach ines</w> 374223
mach ines</w> 214050
Mach th 119438

You can see that it has mach ine</w> . We don't see Mach ine in there - so it must be handling lower cased look ups when normal case is not matching.

Now let's check: Lear@@ + ning

$ grep -i ^lear  ~/porting/pytorch_fairseq_model/bpecodes
lear n</w> 675290
lear ned</w> 505087
lear ning</w> 417623

We find lear ning</w> is there (again the case is not the same).

Thinking more about it, the case probably doesn't matter for tokenization, as long as there is a unique entry for Mach / Lear and mach / lear in the dictionary where it's very critical to have each case covered.

Hopefully, you can now see how this works.

One confusing thing is that if you remember the apply_bpe output was:

('apply_bpe: ', 6, ['Mach@@', 'ine', 'Lear@@', 'ning', 'is', 'great'])

Instead of marking endings of the words with </w> , it leaves those as is, but, instead, marks words that were not the endings with @@ . This is probably so, because fastBPE implementation is used by fairseq and that's how it does things. I had to change this to fit the transformers implementation, which doesn't use fastBPE .

One last thing to check is the remapping of the BPE codes to vocabulary ids. To repeat, we had:

('apply_bpe: ', 'Mach@@ ine Lear@@ ning is great')
('binarize: ', 7, tensor([10217,  1419,     3,  2515,    21,  1054,     2]))

2 - the last token id is a eos (end of stream) token. It's used to indicate to the model the end of input.

And then Mach@@ gets remapped to 10217 , and ine to 1419 .

Let's check that the dictionary file is in agreement:

$ grep ^Mach@@ ~/porting/pytorch_fairseq_model/dict.en.txt
Mach@@ 6410
$ grep "^ine " ~/porting/pytorch_fairseq_model/dict.en.txt
ine 88376

Wait a second - those aren't the ids that we got after binarize , which should be 10217 and 1419 correspondingly.

It took some investigating to find out that the vocab file ids aren't the ids used by the model and that internally it remaps them to new ids once the vocab file is loaded. Luckily, I didn't need to figure out how exactly it was done. Instead, I just used fairseq.data.dictionary.Dictionary.load to load the dictionary (*), which performed all the re-mappings, - and I then saved the final dictionary. I found out about that Dictionary class by stepping through fairseq code with debugger.

footnote: the more I work on porting models and datasets, the more I realize that putting the original code to work for me, rather than trying to replicate it, is a huge time saver and most importantly that code has already been tested - it's too easy to miss something and down the road discover big problems! After all, at the end, none of this conversion code will matter, since only the data it generated will be used by transformers and its end users.

Here is the relevant part of the conversion script:

from fairseq.data.dictionary import Dictionary
def rewrite_dict_keys(d):
    # (1) remove word breaking symbol
    # (2) add word ending symbol where the word is not broken up,
    # e.g.: d = {'le@@': 5, 'tt@@': 6, 'er': 7} => {'le': 5, 'tt': 6, 'er</w>': 7}
    d2 = dict((re.sub(r"@@$", "", k), v) if k.endswith("@@") else (re.sub(r"$", "</w>", k), v) for k, v in d.items())
    keep_keys = "<s> <pad> </s> <unk>".split()
    # restore the special tokens
    for k in keep_keys:
        del d2[f"{k}</w>"]
        d2[k] = d[k]  # restore
    return d2

src_dict_file = os.path.join(fsmt_folder_path, f"dict.{src_lang}.txt")
src_dict = Dictionary.load(src_dict_file)
src_vocab = rewrite_dict_keys(src_dict.indices)
src_vocab_size = len(src_vocab)
src_vocab_file = os.path.join(pytorch_dump_folder_path, "vocab-src.json")
print(f"Generating {src_vocab_file}")
with open(src_vocab_file, "w", encoding="utf-8") as f:
    f.write(json.dumps(src_vocab, ensure_ascii=False, indent=json_indent))
# we did the same for the target dict - omitted quoting it here
# and we also had to save `bpecodes`, it's called `merges.txt` in the transformers land

After running the conversion script, let's check the converted dictionary:

$ grep '"Mach"' /code/huggingface/transformers-fair-wmt/data/wmt19-en-ru/vocab-src.json
  "Mach": 10217,
$ grep '"ine</w>":' /code/huggingface/transformers-fair-wmt/data/wmt19-en-ru/vocab-src.json
  "ine</w>": 1419,

We have the correct ids in the transformers version of the vocab file.

As you can see I also had to re-write the vocabularies to match the transformers BPE implementation. We have to change:

['Mach@@', 'ine', 'Lear@@', 'ning', 'is', 'great']

to:

['Mach', 'ine</w>', 'Lear', 'ning</w>', 'is</w>', 'great</w>']

Instead of marking chunks that are segments of a word, with the exception of the last segment, we mark segments or words that are the final segment. One can easily go from one style of encoding to another and back.

This successfully completed the porting of the first part of the model files. You can see the final version of the code here .

If you're curious to look deeper there are more tinkering bits in this notebook .

Porting tokenizer's encoder to transformers

transformers can't rely on fastBPE since the latter requires a C-compiler, but luckily someone already implemented a python version of the same in tokenization_xlm.py .

So I just copied it to src/transformers/tokenization_fsmt.py and renamed the class names:

cp tokenization_xlm.py tokenization_fsmt.py
perl -pi -e 's|XLM|FSMT|ig; s|xlm|fsmt|g;' tokenization_fsmt.py

and with very few changes I had a working encoder part of the tokenizer. There was a lot of code that didn't apply to the languages I needed to support, so I removed that code.

Since I needed 2 different vocabularies, instead of one here in tokenizer and everywhere else I had to change the code to support both. So for example I had to override the super-class' methods:

def get_vocab(self) -> Dict[str, int]:
        return self.get_src_vocab()

    @property
    def vocab_size(self) -> int:
        return self.src_vocab_size

Since fairseq didn't use bos (beginning of stream) tokens, I also had to change the code to not include those (*):

-            return bos + token_ids_0 + sep
-        return bos + token_ids_0 + sep + token_ids_1 + sep
+            return token_ids_0 + sep
+        return token_ids_0 + sep + token_ids_1 + sep

footnote: this is the output of diff(1) which shows the difference between two chunks of code - lines starting with - show what was removed, and with + what was added.

fairseq was also escaping characters and performing an aggressive dash splitting, so I had to also change:

-        [...].tokenize(text, return_str=False, escape=False)
+        [...].tokenize(text, return_str=False, escape=True, aggressive_dash_splits=True)

If you're following along, and would like to see all the changes I did to the original tokenization_xlm.py , you can do:

cp tokenization_xlm.py tokenization_orig.py
perl -pi -e 's|XLM|FSMT|g; s|xlm|fsmt|g;' tokenization_orig.py
diff -u tokenization_orig.py tokenization_fsmt.py  | less

Just make sure you're checking out the repository around the time fsmt was released , since the 2 files could have diverged since then.

The final stage was to run through a bunch of inputs and to ensure that the ported tokenizer produced the same ids as the original. You can see this is done in this notebook , which I was running repeatedly while trying to figure out how to make the outputs match.

This is how most of the porting process went, I'd take a small feature, run it the fairseq -way, get the outputs, do the same with the transformers code, try to make the outputs match - fiddle with the code until it did, then try a different kind of input make sure it produced the same outputs, and so on, until all inputs produced outputs that matched.

Porting the core translation functionality

Having had a relatively quick success with porting the tokenizer (obviously, thanks to most of the code being there already), the next stage was much more complex. This is the generate() function which takes inputs ids, runs them through the model and returns output ids.

Porting fairseq wmt19 translation system to transformers