Machine Translation with Transformer

In this notebook, we will show how to train Transformer introduced in [1] and evaluate the pretrained model using GluonNLP. The model is both more accurate and lighter to train than previous seq2seq models. We will together go through:

  1. Use the state-of-the-art pretrained Transformer model: we will evaluate the pretrained SOTA Transformer model and translate a few sentences ourselves with the BeamSearchTranslator using the SOTA model;

  2. Train the Transformer yourself: including loading and processing dataset, define the Transformer model, write train script and evaluate the trained model. Note that in order to obtain the state-of-the-art results on WMT 2014 English-German dataset, it will take around 1 day to have the model. In order to let you run through the Transformer quickly, we suggest you to start with the TOY dataset sampled from the WMT dataset (by default in this notebook).

Preparation

Load MXNet and GluonNLP

import warnings
warnings.filterwarnings('ignore')

import random
import numpy as np
import mxnet as mx
from mxnet import gluon
import gluonnlp as nlp

Set Environment

np.random.seed(100)
random.seed(100)
mx.random.seed(10000)
ctx = mx.gpu(0)

Use the SOTA Pretrained Transformer model

In this subsection, we first load the SOTA Transformer model in GluonNLP model zoo; and secondly we load the full WMT 2014 English-German test dataset; and finally evaluate the model.

Get the SOTA Transformer

Next, we load the pretrained SOTA Transformer using the model API in GluonNLP. In this way, we can easily get access to the SOTA machine translation model and use it in your own application.

import nmt

wmt_model_name = 'transformer_en_de_512'

wmt_transformer_model, wmt_src_vocab, wmt_tgt_vocab = \
    nmt.transformer.get_model(wmt_model_name,
                              dataset_name='WMT2014',
                              pretrained=True,
                              ctx=ctx)

print(wmt_src_vocab)
print(wmt_tgt_vocab)

The Transformer model architecture is shown as below:

transformer

print(wmt_transformer_model)

Load and Preprocess WMT 2014 Dataset

We then load the WMT 2014 English-German test dataset for evaluation purpose.

The following shows how to process the dataset and cache the processed dataset for the future use. The processing steps include:

    1. clip the source and target sequences

    1. split the string input to a list of tokens

    1. map the string token into its index in the vocabulary

    1. append EOS token to source sentence and add BOS and EOS tokens to target sentence.

Let’s first look at the WMT 2014 corpus.

import hyperparameters as hparams

wmt_data_test = nlp.data.WMT2014BPE('newstest2014',
                                    src_lang=hparams.src_lang,
                                    tgt_lang=hparams.tgt_lang,
                                    full=False)
print('Source language %s, Target language %s' % (hparams.src_lang, hparams.tgt_lang))

wmt_data_test[0]
wmt_test_text = nlp.data.WMT2014('newstest2014',
                                 src_lang=hparams.src_lang,
                                 tgt_lang=hparams.tgt_lang,
                                 full=False)
wmt_test_text[0]

We then generate the target gold translations.

wmt_test_tgt_sentences = list(wmt_test_text.transform(lambda src, tgt: tgt))
wmt_test_tgt_sentences[0]
import dataprocessor

print(dataprocessor.TrainValDataTransform.__doc__)
wmt_transform_fn = dataprocessor.TrainValDataTransform(wmt_src_vocab, wmt_tgt_vocab, -1, -1)
wmt_dataset_processed = wmt_data_test.transform(wmt_transform_fn, lazy=False)
print(*wmt_dataset_processed[0], sep='\n')

Create Sampler and DataLoader for WMT 2014 Dataset

wmt_data_test_with_len = gluon.data.SimpleDataset([(ele[0], ele[1], len(
    ele[0]), len(ele[1]), i) for i, ele in enumerate(wmt_dataset_processed)])

Now, we have obtained data_train, data_val, and data_test. The next step is to construct sampler and DataLoader. The first step is to construct batchify function, which pads and stacks sequences to form mini-batch.

wmt_test_batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(),
    nlp.data.batchify.Pad(),
    nlp.data.batchify.Stack(dtype='float32'),
    nlp.data.batchify.Stack(dtype='float32'),
    nlp.data.batchify.Stack())

We can then construct bucketing samplers, which generate batches by grouping sequences with similar lengths.

wmt_bucket_scheme = nlp.data.ExpWidthBucket(bucket_len_step=1.2)
wmt_test_batch_sampler = nlp.data.FixedBucketSampler(
    lengths=wmt_dataset_processed.transform(lambda src, tgt: len(tgt)),
    use_average_length=True,
    bucket_scheme=wmt_bucket_scheme,
    batch_size=256)
print(wmt_test_batch_sampler.stats())

Given the samplers, we can create DataLoader, which is iterable.

wmt_test_data_loader = gluon.data.DataLoader(
    wmt_data_test_with_len,
    batch_sampler=wmt_test_batch_sampler,
    batchify_fn=wmt_test_batchify_fn,
    num_workers=8)
len(wmt_test_data_loader)

Evaluate Transformer

Next, we generate the SOTA results on the WMT test dataset. As we can see from the result, we are able to achieve the SOTA number 27.35 as the BLEU score.

We first define the BeamSearchTranslator to generate the actual translations.

wmt_translator = nmt.translation.BeamSearchTranslator(
    model=wmt_transformer_model,
    beam_size=hparams.beam_size,
    scorer=nlp.model.BeamSearchScorer(alpha=hparams.lp_alpha, K=hparams.lp_k),
    max_length=200)

Then we caculate the loss as well as the bleu score on the WMT 2014 English-German test dataset. Note that the following evalution process will take ~13 mins to complete.

import time
import utils

eval_start_time = time.time()

wmt_test_loss_function = nmt.loss.SoftmaxCEMaskedLoss()
wmt_test_loss_function.hybridize()

wmt_detokenizer = nlp.data.SacreMosesDetokenizer()

wmt_test_loss, wmt_test_translation_out = utils.evaluate(wmt_transformer_model,
                                                         wmt_test_data_loader,
                                                         wmt_test_loss_function,
                                                         wmt_translator,
                                                         wmt_tgt_vocab,
                                                         wmt_detokenizer,
                                                         ctx)

wmt_test_bleu_score, _, _, _, _ = nmt.bleu.compute_bleu([wmt_test_tgt_sentences],
                                                        wmt_test_translation_out,
                                                        tokenized=False,
                                                        tokenizer=hparams.bleu,
                                                        split_compound_word=False,
                                                        bpe=False)

print('WMT14 EN-DE SOTA model test loss: %.2f; test bleu score: %.2f; time cost %.2fs'
      %(wmt_test_loss, wmt_test_bleu_score * 100, (time.time() - eval_start_time)))
print('Sample translations:')
num_pairs = 3

for i in range(num_pairs):
    print('EN:')
    print(wmt_test_text[i][0])
    print('DE-Candidate:')
    print(wmt_test_translation_out[i])
    print('DE-Reference:')
    print(wmt_test_tgt_sentences[i])
    print('========')

Translation Inference

We herein show the actual translation example (EN-DE) when given a source language using the SOTA Transformer model.

import utils

print('Translate the following English sentence into German:')

sample_src_seq = 'We love each other'

print('[\'' + sample_src_seq + '\']')

sample_tgt_seq = utils.translate(wmt_translator,
                                 sample_src_seq,
                                 wmt_src_vocab,
                                 wmt_tgt_vocab,
                                 wmt_detokenizer,
                                 ctx)

print('The German translation is:')
print(sample_tgt_seq)

Train Your Own Transformer

In this subsection, we will go though the whole process about loading translation dataset in a more unified way, and create data sampler and loader, as well as define the Transformer model, finally writing training script to train the model yourself.

Load and Preprocess TOY Dataset

Note that we use demo mode (TOY dataset) by default, since loading the whole WMT 2014 English-German dataset WMT2014BPE for the later training will be slow (~1 day). But if you really want to train to have the SOTA result, please set demo = False. In order to make the data processing blocks execute in a more efficient way, we package them in the load_translation_data (transform etc.) function used as below. The function also returns the gold target sentences as well as the vocabularies.

demo = True
if demo:
    dataset = 'TOY'
else:
    dataset = 'WMT2014BPE'

data_train, data_val, data_test, val_tgt_sentences, test_tgt_sentences, src_vocab, tgt_vocab = \
    dataprocessor.load_translation_data(
        dataset=dataset,
        src_lang=hparams.src_lang,
        tgt_lang=hparams.tgt_lang)

data_train_lengths = dataprocessor.get_data_lengths(data_train)
data_val_lengths = dataprocessor.get_data_lengths(data_val)
data_test_lengths = dataprocessor.get_data_lengths(data_test)

data_train = data_train.transform(lambda src, tgt: (src, tgt, len(src), len(tgt)), lazy=False)
data_val = gluon.data.SimpleDataset([(ele[0], ele[1], len(ele[0]), len(ele[1]), i)
                          for i, ele in enumerate(data_val)])
data_test = gluon.data.SimpleDataset([(ele[0], ele[1], len(ele[0]), len(ele[1]), i)
                           for i, ele in enumerate(data_test)])

Create Sampler and DataLoader for TOY Dataset

Now, we have obtained data_train, data_val, and data_test. The next step is to construct sampler and DataLoader. The first step is to construct batchify function, which pads and stacks sequences to form mini-batch.

train_batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(),
    nlp.data.batchify.Pad(),
    nlp.data.batchify.Stack(dtype='float32'),
    nlp.data.batchify.Stack(dtype='float32'))
test_batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(),
    nlp.data.batchify.Pad(),
    nlp.data.batchify.Stack(dtype='float32'),
    nlp.data.batchify.Stack(dtype='float32'),
    nlp.data.batchify.Stack())

target_val_lengths = list(map(lambda x: x[-1], data_val_lengths))
target_test_lengths = list(map(lambda x: x[-1], data_test_lengths))

We can then construct bucketing samplers, which generate batches by grouping sequences with similar lengths.

bucket_scheme = nlp.data.ExpWidthBucket(bucket_len_step=1.2)
train_batch_sampler = nlp.data.FixedBucketSampler(lengths=data_train_lengths,
                                             batch_size=hparams.batch_size,
                                             num_buckets=hparams.num_buckets,
                                             ratio=0.0,
                                             shuffle=True,
                                             use_average_length=True,
                                             num_shards=1,
                                             bucket_scheme=bucket_scheme)
print('Train Batch Sampler:')
print(train_batch_sampler.stats())


val_batch_sampler = nlp.data.FixedBucketSampler(lengths=target_val_lengths,
                                       batch_size=hparams.test_batch_size,
                                       num_buckets=hparams.num_buckets,
                                       ratio=0.0,
                                       shuffle=False,
                                       use_average_length=True,
                                       bucket_scheme=bucket_scheme)
print('Validation Batch Sampler:')
print(val_batch_sampler.stats())

test_batch_sampler = nlp.data.FixedBucketSampler(lengths=target_test_lengths,
                                        batch_size=hparams.test_batch_size,
                                        num_buckets=hparams.num_buckets,
                                        ratio=0.0,
                                        shuffle=False,
                                        use_average_length=True,
                                        bucket_scheme=bucket_scheme)
print('Test Batch Sampler:')
print(test_batch_sampler.stats())

Given the samplers, we can create DataLoader, which is iterable. Note that the data loader of validation and test dataset share the same batchifying function test_batchify_fn.

train_data_loader = nlp.data.ShardedDataLoader(data_train,
                                      batch_sampler=train_batch_sampler,
                                      batchify_fn=train_batchify_fn,
                                      num_workers=8)
print('Length of train_data_loader: %d' % len(train_data_loader))
val_data_loader = gluon.data.DataLoader(data_val,
                             batch_sampler=val_batch_sampler,
                             batchify_fn=test_batchify_fn,
                             num_workers=8)
print('Length of val_data_loader: %d' % len(val_data_loader))
test_data_loader = gluon.data.DataLoader(data_test,
                              batch_sampler=test_batch_sampler,
                              batchify_fn=test_batchify_fn,
                              num_workers=8)
print('Length of test_data_loader: %d' % len(test_data_loader))

Define Transformer Model

After obtaining DataLoader, we then start to define the Transformer. The encoder and decoder of the Transformer can be easily obtained by calling get_transformer_encoder_decoder function. Then, we use the encoder and decoder in NMTModel to construct the Transformer model. model.hybridize allows computation to be done using symbolic backend. We also use label_smoothing.

encoder, decoder = nmt.transformer.get_transformer_encoder_decoder(units=hparams.num_units,
                                                   hidden_size=hparams.hidden_size,
                                                   dropout=hparams.dropout,
                                                   num_layers=hparams.num_layers,
                                                   num_heads=hparams.num_heads,
                                                   max_src_length=530,
                                                   max_tgt_length=549,
                                                   scaled=hparams.scaled)
model = nmt.translation.NMTModel(src_vocab=src_vocab, tgt_vocab=tgt_vocab, encoder=encoder, decoder=decoder,
                 share_embed=True, embed_size=hparams.num_units, tie_weights=True,
                 embed_initializer=None, prefix='transformer_')
model.initialize(init=mx.init.Xavier(magnitude=3.0), ctx=ctx)
model.hybridize()

print(model)

label_smoothing = nmt.loss.LabelSmoothing(epsilon=hparams.epsilon, units=len(tgt_vocab))
label_smoothing.hybridize()

loss_function = nmt.loss.SoftmaxCEMaskedLoss(sparse_label=False)
loss_function.hybridize()

test_loss_function = nmt.loss.SoftmaxCEMaskedLoss()
test_loss_function.hybridize()

detokenizer = nlp.data.SacreMosesDetokenizer()

Here, we build the translator using the beam search

translator = nmt.translation.BeamSearchTranslator(model=model,
                                                  beam_size=hparams.beam_size,
                                                  scorer=nlp.model.BeamSearchScorer(alpha=hparams.lp_alpha,
                                                                                    K=hparams.lp_k),
                                                  max_length=200)
print('Use beam_size=%d, alpha=%.2f, K=%d' % (hparams.beam_size, hparams.lp_alpha, hparams.lp_k))

Training Loop

Before conducting training, we need to create trainer for updating the parameter. In the following example, we create a trainer that uses ADAM optimzier.

trainer = gluon.Trainer(model.collect_params(), hparams.optimizer,
                        {'learning_rate': hparams.lr, 'beta2': 0.98, 'epsilon': 1e-9})
print('Use learning_rate=%.2f'
      % (trainer.learning_rate))

We can then write the training loop. During the training, we perform the evaluation on validation and testing dataset every epoch, and record the parameters that give the hightest BLEU score on validation dataset. Before performing forward and backward, we first use as_in_context function to copy the mini-batch to GPU. The statement with mx.autograd.record() will locate Gluon backend to compute the gradients for the part inside the block. For ease of observing the convergence of the update of the Loss in a quick fashion, we set the epochs = 3. Notice that, in order to obtain the best BLEU score, we will need more epochs and large warmup steps following the original paper as you can find the SOTA results in the first subsection. Besides, we use Averaging SGD [2] to update the parameters, since it is more robust for the machine translation task.

best_valid_loss = float('Inf')
step_num = 0
#We use warmup steps as introduced in [1].
warmup_steps = hparams.warmup_steps
grad_interval = hparams.num_accumulated
model.setattr('grad_req', 'add')
#We use Averaging SGD [2] to update the parameters.
average_start = (len(train_data_loader) // grad_interval) * \
    (hparams.epochs - hparams.average_start)
average_param_dict = {k: mx.nd.array([0]) for k, v in
                                      model.collect_params().items()}
update_average_param_dict = True
model.zero_grad()
for epoch_id in range(hparams.epochs):
    utils.train_one_epoch(epoch_id, model, train_data_loader, trainer,
                          label_smoothing, loss_function, grad_interval,
                          average_param_dict, update_average_param_dict,
                          step_num, ctx)
    mx.nd.waitall()
    # We define evaluation function as follows. The `evaluate` function use beam search translator
    # to generate outputs for the validation and testing datasets.
    valid_loss, _ = utils.evaluate(model, val_data_loader,
                                   test_loss_function, translator,
                                   tgt_vocab, detokenizer, ctx)
    print('Epoch %d, valid Loss=%.4f, valid ppl=%.4f'
          % (epoch_id, valid_loss, np.exp(valid_loss)))
    test_loss, _ = utils.evaluate(model, test_data_loader,
                                  test_loss_function, translator,
                                  tgt_vocab, detokenizer, ctx)
    print('Epoch %d, test Loss=%.4f, test ppl=%.4f'
          % (epoch_id, test_loss, np.exp(test_loss)))
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        model.save_parameters('{}.{}'.format(hparams.save_dir, 'valid_best.params'))
    model.save_parameters('{}.epoch{:d}.params'.format(hparams.save_dir, epoch_id))
mx.nd.save('{}.{}'.format(hparams.save_dir, 'average.params'), average_param_dict)

if hparams.average_start > 0:
    for k, v in model.collect_params().items():
        v.set_data(average_param_dict[k])
else:
    model.load_parameters('{}.{}'.format(hparams.save_dir, 'valid_best.params'), ctx)
valid_loss, _ = utils.evaluate(model, val_data_loader,
                               test_loss_function, translator,
                               tgt_vocab, detokenizer, ctx)
print('Best model valid Loss=%.4f, valid ppl=%.4f'
      % (valid_loss, np.exp(valid_loss)))
test_loss, _ = utils.evaluate(model, test_data_loader,
                              test_loss_function, translator,
                              tgt_vocab, detokenizer, ctx)
print('Best model test Loss=%.4f, test ppl=%.4f'
      % (test_loss, np.exp(test_loss)))

Conclusion

  • Showcase with Transformer, we are able to support the deep neural networks for seq2seq task. We have already achieved SOTA results on the WMT 2014 English-German task.

  • Gluon NLP Toolkit provides high-level APIs that could drastically simplify the development process of modeling for NLP tasks sharing the encoder-decoder structure.

  • Low-level APIs in NLP Toolkit enables easy customization.

Documentation can be found at https://gluon-nlp.mxnet.io/index.html

Code is here https://github.com/dmlc/gluon-nlp

References

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017.

[2] Polyak, Boris T, and Anatoli B. Juditsky. “Acceleration of stochastic approximation by averaging.” SIAM Journal on Control and Optimization. 1992.