ACL 2021 Tutorial
Pre-training Methods for Neural Machine Translation

Mingxuan Wang
ByteDance AI Lab
Lei Li
University of California Santa Barbara
August 1, 2021

1 Tutorial Introduction

Pre-training is a dominant paradigm in Nature Language Processing (NLP) [28, 8, 20], Computer Vision (CV) [12, 34] and Auto Speech Recognition (ASR) [3, 6, 24]. Typically, the models are first pre-trained on large amount of unlabeled data to capture rich representations of the input, and then applied to the downstream tasks by either providing context-aware representation of the input, or initializing the parameters of the downstream model for fine-tuning. Recently, the trend of self-supervised pre-training and task-specific fine-tuning finally fully hits neural machine translation (NMT) [37, 35, 5].

Despite its success, introducing a universal pre-trained model to NMT is non-trivial and not necessarily yields promising results, especially for the resource-rich setup. Unique challenges remain in several aspects. First, the objective of most pretraining methods are different from the downstream NMT tasks. For example, BERT [8], a popular pre-trained model, is designed for language understanding with only a transformer encoder, while an NMT model usually consists of an encoder and a decoder to perform cross-lingual generation. This gap makes it not feasible enough to apply pre-training for NMT [30]. Besides, machine translation is naturally a multi-lingual problem, but general pre-training methods for NLP mainly focus on English corpus, such as BERT and GPT. Given the success of transfer learning in multi-lingual machine translation, it is very appealing to introduce multi-lingual pre-training for NMT [7]. Finally, speech translation has attracted much attention recently, while most pre-training methods are focused on text representation. How to leverage the pre-training methods to improve the speech translation becomes a new challenge.

This tutorial provides a comprehensive guide to make the most of pre-training for neural machine translation. Firstly, we will briefly introduce the background of NMT, pre-training methodology, and point out the main challenges when applying pre-training for NMT. Then we will focus on analysing the role of pre-training in enhancing the performance of NMT, how to design a better pre-training model for executing specific NMT tasks and how to better integrate the pre-trained model into NMT system. In each part, we will provide examples, discuss training techniques and analyse what is transferred when applying pre-training.

The first topic is the monolingual pre-training for NMT, which is one of the most well-studied field. Monolingual text representations like ELMo, GPT, MASS and BERT have superiorities, which significantly boost the performances of various natural language processing tasks [25, 8, 28, 30]. However, NMT has several distinct characteristics, such as the availability of large training data (10 million or larger) and the high capacity of baseline NMT models, which requires carefully design of pre-training. In this part, we will introduce different pre-training methods and analyse the best practice when applying them to different machine translation scenarios, such as unsupervised NMT, low-resource NMT and rich-source NMT [37, 35]. We will cover techniques to finetune the pre-trained models with various strategies, such as knowledge distillation and adapter [4, 16].

The next topic is multi-lingual pre-training for NMT. In this context, we aims at mitigating the English-centric bias and suggest that it is possible to build universal representation for different language to improve massive multi-lingual NMT. In this part, we will discuss the general representation of different languages and analyse how knowledge transfers across languages. These will allow a better design for multi-lingual pre-training, in particular for zero-shot transfer to non-English language pairs [15, 27, 7, 26, 13, 17, 19, 23, 18].

The last technical part of this tutorial deals with the Pre-training for speech NMT. In particular, we focus on leverage weakly supervised or unsupervised training data to improve speech translation. In this part, we will discuss the possibilities of building a general representations across speech and text. And shows how text or audio pre-training can guild the text generation of NMT [33, 21, 3, 32, 1, 2, 14, 22, 10, 9, 11, 36].

We conclude the tutorial by pointing out the best practice when applying pre-training for NMT. The topics cover various of pre-training methods for different NMT scenarios. After this tutorial, the audience will understand why pre-training for NMT is different from other tasks and how to make the most of pre-training for NMT. Importantly, we will give deep analyze about how and why pre-training works in NMT, which will inspire future work on designing pre-training paradigm specific for NMT.

2 Tutorial Outline

PART I: Introduction (15 min) [slides]

  • Background of NMT

  • General pre-training paradigm

  • Unique Challenges

    • Objective difference

    • Multi-lingual generation

    • Modality disparity

PART II: Monolingual Pre-training for NMT (60 min) [slides]

  • The early stage

    • NMT initialized with word2vec

    • NMT initialized with language model

  • BERT fusion in NMT

    • BERT Incorporating methods

    • BERT Tuning methods

  • Unified sequence-to-sequence pre-training

    • MASS, Bart, etc.

PART III: Multi-lingual Pre-training for NMT (45 min) [slides]

  • Multilingual fused pre-training

    • Cross-lingual Language Model Pre-training

    • Alternating Language Modeling Pre-training

    • XLM-T: Cross-lingual Transformer Encoders

  • Multilingual sequence to sequence pre-training

    • mBART

    • CSP

    • mRASP

PART IV: Pre-training for Speech Translation (45 min) [slides]

  • MT pre-training

  • ASR pre-training

  • Audio pre-training

  • Raw text pre-training

  • Bi-modal pre-training

PART V: Conclusion and Future Directions (15 min)

3 Prerequisites

The tutorial is self-contained. We will address the background, the technical details and the examples. Basic knowledge about neural networks are required, including word embeddings, attention, and encoder-decoder models. Prior NLP courses and familarity with the machine translation task are preferred.

It is recommended (and optional) that audience to read the following papers before the tutorial:

  1. Basic MT model: Attention is all you need [31].

  2. Google’s multilingual neural machine translation system [15].

  3. Text pre-training with BERT [8] and GPT [28].

  4. Audio pre-training with Wav2vec and Wav2vec2.0 [29, 2].

  5. Pre-training multilingual NMT [17, 19].

4 Target Audience

This tutorial will be suitable for researchers and practitioners interested in pre-training applications and multilingual NLP, especially for machine translation.

To the best of our knowledge, this is the first tutorial that focuses on the pre-training methods and practice for NMT.

5Tutorial Presenters

Mingxuan Wang

(ByteDance AI Lab)

Dr. Mingxuan Wang is a senior researcher at ByteDance AI Lab. He received his PhD degree from the Chinese Academy of Sciences Institute of Computing Technology in 2017. His research focuses on natural language processing and machine translation. He has published over 20 papers in leading NLP/AI journals and conferences such as ACL, AAAI and EMNLP. He has served in the Program Committee for ACL/EMNLP 2016-2020, AAAI/IJCAI 2018/2019, NeurIPS 2020. He achieved outstanding results in various machine translation evaluation competitions, including the first place of Chinese-to-English translation at at the WMT 2018, the third place of Chinese-to-English translation at NIST 2015, etc. Together with Dr. Lei Li, he is leading a team developing the VolcTrans machine translation system.

He has given a tutorial about Machine Translation at CCMT 2017 and was an guest lecturer for 2016 Machine Translation for University of Chinese Academy of Sciences (UCAS).

Lei Li


Dr. Lei Li is an assistant professor in Computer Science Department at University of California Santa Barbara. His research interest lies in natural language processing, machine translation, and AI-powered drug discovery. He received his B.S. from Shanghai Jiao Tong University and Ph.D. from Carnegie Mellon University. His dissertation work on fast algorithms for mining co-evolving time series was awarded ACM KDD best dissertation (runner up). His recent work on AI writer Xiaomingbot received 2nd-class award of Wu Wen-tsün AI prize in 2017. He is a recipient of ACL 2021 best paper award, CCF Young Elite award in 2019, and CCF distinguished speaker in 2017. His team won first places for five language translation directions in WMT 2020 and the best in corpus filtering challenge. Previously, he worked at EECS department of UC Berkeley, Baidu’s Institute of Deep Learning in Silicon Valley, and at ByteDance as the founding director of AI Lab. He has served organizers and area chair/senior PC for multiple conferences including KDD, ACL, EMNLP, NeurIPS, AAAI, IJCAI, and CIKM. He has published over 100 technical papers in ML, NLP and data mining and holds more than 10 patents. He has started ByteDance’s machine translation system, VolcTrans and many of his algorithms have been deployed in production.

He has delivered four tutorials at EMNLP 2019, NLPCC 2019, NLPCC 2016, and KDD 2010. He was an lecturer for 2014 Probabilistic Programming for Advancing Machine Learning summer school at Portland, USA.

6 Related Tutorials

Neural Machine Translation, presented by Thang Luong, Kyunghyun Cho, and Christopher Manning at ACL 2016. This tutorial is related but different from ACL 2016 NMT tutorial. It focuses on pre-training methods for both bilingual, multi-lingual, and multi-modal neural machine translation.

Unsupervised Cross-Lingual Representation Learning, presented by Sebastian Ruder, Anders Søgaard, and Ivan Vulić at ACL 2019. This tutorial is related in concerning multi-lingual NLP. However, their tutorial was on representation learning, while our tutorial is on neural machine translation.


  • [1] A. Baevski, S. Schneider, and M. Auli (2020) Vq-wav2vec: self-supervised learning of discrete speech representations. In Proc. of ICLR, Cited by: §1.
  • [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §1, item 4.
  • [3] S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater (2019) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proc. of NAACL-HLT, pp. 58–68. Cited by: §1, §1.
  • [4] A. Bapna and O. Firat (2019) Simple, scalable adaptation for neural machine translation. In Proc. of EMNLP, pp. 1538–1548. Cited by: §1.
  • [5] Y. Chen, Z. Gan, Y. Cheng, J. Liu, and J. Liu (2020) Distilling knowledge learned in BERT for text generation. In Proc. of ACL, pp. 7893–7905. Cited by: §1.
  • [6] Y. Chuang, C. Liu, and H. Lee (2020) SpeechBERT: an audio-and-text jointly learned language model for end-to-end spoken question answering. In Proc. of INTERSPEECH, Cited by: §1.
  • [7] A. Conneau and G. Lample (2019) Cross-lingual language model pretraining. In Proc. of NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 7057–7067. Cited by: §1, §1.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, pp. 4171–4186. Cited by: §1, §1, §1, item 3.
  • [9] Q. Dong, M. Wang, H. Zhou, S. Xu, B. Xu, and L. Li (2021) Consecutive decoding for speech-to-text translation. In Proc. of AAAI, Cited by: §1.
  • [10] Q. Dong, R. Ye, M. Wang, H. Zhou, S. Xu, B. Xu, and L. Li (2021) Listen, understand and translate: triple supervision decouples end-to-end speech-to-text translation. In Proc. of AAAI, Vol. 35, pp. 12749–12759. Cited by: §1.
  • [11] C. Han, M. Wang, H. Ji, and L. Li (2021-08) Learning shared semantic space for speech-to-text translation. In Proc. of ACL - Findings, External Links: Cited by: §1.
  • [12] K. He, R. B. Girshick, and P. Dollár (2019) Rethinking imagenet pre-training. In Proc. of ICCV, pp. 4917–4926. Cited by: §1.
  • [13] H. Huang, Y. Liang, N. Duan, M. Gong, L. Shou, D. Jiang, and M. Zhou (2019) Unicoder: a universal language encoder by pre-training with multiple cross-lingual tasks. In Proc. of EMNLP, pp. 2485–2494. Cited by: §1.
  • [14] H. Huang, L. Su, D. Qi, N. Duan, E. Cui, T. Bharti, L. Zhang, L. Wang, J. Gao, B. Liu, et al. (2021) M3P: learning universal representations via multitask multilingual multimodal pre-training. In Proc. of CVPR, Cited by: §1.
  • [15] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. TACL 5, pp. 339–351. Cited by: §1, item 2.
  • [16] J. Liang, C. Zhao, M. Wang, X. Qiu, and L. Li (2021-02) Finding sparse structure for domain specific neural machine translation. In Proc. of AAAI, External Links:, Link Cited by: §1.
  • [17] Z. Lin, X. Pan, M. Wang, X. Qiu, J. Feng, H. Zhou, and L. Li (2020) Pre-training multilingual neural machine translation by leveraging alignment information. In Proc. of EMNLP, pp. 2649–2663. Cited by: §1, item 5.
  • [18] Z. Lin, L. Wu, M. Wang, and L. Li (2021-08) Learning language specific sub-network for multilingual machine translation. In Proc. of ACL, External Links: Cited by: §1.
  • [19] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual denoising pre-training for neural machine translation. TACL 8, pp. 726–742. Cited by: §1, item 5.
  • [20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. S. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §1.
  • [21] Y. Liu, H. Xiong, J. Zhang, Z. He, H. Wu, H. Wang, and C. Zong (2019) End-to-end speech translation with knowledge distillation. In Proc. of INTERSPEECH, G. Kubin and Z. Kacic (Eds.), pp. 1128–1132. Cited by: §1.
  • [22] Q. Long, M. Wang, and L. Li (2021) Generative imagination elevates machine translation. In Proc. of NAACL-HLT, pp. 5738–5748. Cited by: §1.
  • [23] X. Pan, L. Wu, M. Wang, and L. Li (2021-08) Contrastive learning for many-to-many multilingual neural machine translation. In Proc. of ACL, External Links: Cited by: §1.
  • [24] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) Specaugment: a simple data augmentation method for automatic speech recognition. In Proc. of INTERSPEECH, Cited by: §1.
  • [25] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL-HLT, pp. 2227–2237. Cited by: §1.
  • [26] T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual BERT?. In Proc. of ACL, pp. 4996–5001. Cited by: §1.
  • [27] Y. Qi, D. Sachan, M. Felix, S. Padmanabhan, and G. Neubig (2018) When and why are pre-trained word embeddings useful for neural machine translation?. In Proc. of NAACL-HLT, pp. 529–535. Cited by: §1.
  • [28] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1, §1, item 3.
  • [29] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition.. In Proc. of INTERSPEECH, Cited by: item 4.
  • [30] K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In Proc. of ICML, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 5926–5936. Cited by: §1, §1.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. of NeurIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: item 1.
  • [32] C. Wang, Y. Wu, S. Liu, M. Zhou, and Z. Yang (2020) Curriculum pre-training for end-to-end speech translation. In Proc. of ACL, pp. 3728–3738. Cited by: §1.
  • [33] X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Y. Wang (2019) VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In Proc. of ICCV, pp. 4580–4590. Cited by: §1.
  • [34] Q. Xie, M. Luong, E. H. Hovy, and Q. V. Le (2020) Self-training with noisy student improves imagenet classification. In Proc. of CVPR, pp. 10684–10695. Cited by: §1.
  • [35] J. Yang, M. Wang, H. Zhou, C. Zhao, W. Zhang, Y. Yu, and L. Li (2020) Towards making the most of BERT in neural machine translation. In Proc. of AAAI, Cited by: §1, §1.
  • [36] R. Ye, M. Wang, and L. Li (2021-08) End-to-end speech translation via cross-modal progressive training. In Proc. of INTERSPEECH, External Links: Cited by: §1.
  • [37] J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and T. Liu (2020) Incorporating BERT into neural machine translation. In Proc. of ICLR, Cited by: §1, §1.