Every Layer Counts: Multi-Layer Multi-Head Attention for Neural Machine Translation

Isaac Ampomah, Sally I McClean, Zhiwei Lin, Glenn Hawe

Research output: Contribution to journalArticlepeer-review

108 Downloads (Pure)


The neural framework employed for the task of neural machine translation (NMT) usually consists of a stack of multiple encoding and decoding layers. However, only the source feature representation from the top-level encoder layer is leveraged by the decoder subnetwork during the generation of target sequence. These models do not fully exploit the useful source representations learned by the lower-level encoder layers. Furthermore, there is no guarantee that the top-level encoder layer encodes all the necessary source information required by the decoder for the target generation. Inspired by recent advances in deep representation learning, this paper proposes a Multi-Layer Multi-Head Attention (MLMHA) module to exploit the different source representations from the multi-layer encoder subnetwork. Specifically, the decoder is allowed a more direct access to multiple encoder layers during the target generation. This technique further improves the translation performance of the model. Also, exposing multiple encoder layers enhances the flow of gradient information between the two subnetworks. Experimental results on two IWSLT language translation tasks (Spanish-English and English-Vietnamese) and WMT’14 English-German demonstrate the effectiveness of allowing the decoder access to representations from multiple encoder layers. Specifically, the MLMHA approaches explored in this paper achieve improvements up to +0.71, +0.75 and +0.49 BLEU points over the Transformer baseline model on the English-German, Spanish-English, and English-Vietnamese translation tasks respectively.
Original languageEnglish
Pages (from-to)51-82
Number of pages32
JournalThe Prague Bulletin of Mathematical Linguistics
Publication statusPublished (in print/issue) - 30 Oct 2020


Dive into the research topics of 'Every Layer Counts: Multi-Layer Multi-Head Attention for Neural Machine Translation'. Together they form a unique fingerprint.

Cite this