How do we train LLMs for machine translation?

Marina Pantcheva
Apr 29
4 min read

Before addressing the question, let's first look into how LLM training is different from training a traditional Neural Machine Translation (NMT) model.

The difference between NMT and LLM systems

Traditional NMT models are trained for the single task of translating. They use parallel datasets containing bilingual source-target pairs. The volume of source text in the training dataset thus equals the volume of target text.

NMT models are highly specialized: their training data, learning objectives, and architecture are all aligned to produce translations. NMT models take a source sentence and learn how to produce the target sentence, optimizing to minimize translation errors.

Large Language Models (like GPT, Llama, Gemini, Claude and others) are trained in a much broader way. They are exposed to massive amounts of multilingual data (not necessarily aligned sentence pairs) and learn to predict the next word (not convert text from one language to another). Typically, they see much more English text than text in other languages. Hence, the training dataset is not symmetrical.

Translation as an emergent capability

During training, a Large Language Model sees billions of sentences in a lot of languages. The model develops a general understanding of language that is language-agnostic. It learns universal concepts unrelated to specific languages. For example, the model might learn the concept of a “dog” from many English sentences, and it can separately learn that the Spanish word “perro” appears in contexts with similar semantics. Deep in the network, the words “dog” and “perro” become linked. When prompted to translate “dog” into Spanish, the model can retrieve the concept and output the Spanish word “perro”. The model has effectively built an interlingua: a high-dimensional cross-lingual representation of concepts shared across languages. This concept of a “dog” is also cross-modal; for instance, it gets activated not only by the written token “dog” but also by images depicting dogs.

The translation capability of LLMs emerges thanks to this cross-lingual representation. Translation is thus an emergent capability. “Emergent” means a capability that was not directly targeted during training but the model still learned to perform it.

How to specialize an LLM for translation

So now we know that translation is an emergent capability of LLMs. LLMs can translate, but they are not optimized for specific translation tasks. We can apply various techniques to a pre-trained model to get the best translation performance and to specialize LLMs for particular domains and content types.

Supervised fine-tuning with parallel data

The most straightforward approach to improve translation quality is to fine-tune an LLM after its initial general training. Fine-tuning here means that we take a pre-trained LLM and train it further on bilingual pairs.

For example, we can take an open LLM like LLaMA and fine-tune it on English-Spanish sentence pairs. This will directly teach the model the mapping from English to Spanish, thus improving accuracy and reducing mistakes.

This combination of pre-training an LLM on a giant dataset followed by targeted fine-tuning gives excellent results. In practical terms, however, fine-tuning LLMs for translation is very resource-intensive. Moreover, when LLMs get heavily fine-tuned on translation, they may forget other skills they have. This phenomenon has the dramatic name of “catastrophic forgetting.”

Using LoRA (Low-Rank Adaptation)

One way to avoid overfitting and forgetting is to freeze the weights of the LLM and add only small changes. This technique is called Low-Rank Adaptation (LoRA). LoRa involves injecting small trainable weights into the model while freezing the original weights. During fine-tuning, only these small LoRa weights get updated. Since the original weights are “frozen”, the model preserves all the original knowledge and capabilities.

LoRa drastically reduces the number of parameters that need updating. Plus, the model retains its general knowledge and avoids catastrophic forgetting because it gets specialized for translation only through the LoRA layers. The result is similar to full fine-tuning, but the process is more efficient.

Augmenting LLMs with Translation Memory and Terminology

An alternative approach to fine-tuning a model’s weights is to augment the model’s capabilities through Retrieval-Augmented Generation (RAG). RAG means that when the LLM translates, it also checks a database of existing translations (a Translation Memory) and approved terminology (a Term Database) and uses those two references to provide the optimal translation.

RAG is thus a type of external memory of high-quality human translations that the LLM can reference. If the translation of a source sentence (or similar sentences) exists in the database, the LLM can consider that translation, ensuring consistency with existing work and alignment with style and terminology.

RAG is a way to combine the power of an LLM with the precision of a lookup system.

The answer in a nutshell

To conclude, training an LLM for translation differs from training an NMT model. In LLMs, translation is an emergent capability due to their broad multilingual training. LLMs benefit from further training and specialization to improve translation quality. Techniques like supervised fine-tuning, Low-Rank Adaptation (LoRA), and Retrieval-Augmented Generation (RAG) are used to adapt LLMs for specific translation tasks. Each of these methods offers a trade-off between efficiency, resource usage, and translation accuracy. The choice between them depends on various factors, such as model size, context length, task type, the nature of the retrieved content (in the case of RAG) and more. Importantly, none of the approaches is entirely error-proof, so it is extremely important to have robust validation and quality assessment processes.

About the Series:

As part of our "Ask the Think Tank" series, members answer reader's questions to help foster knowledge sharing and become a resource when you don't know where to turn. To submit your own question, click here.

AI Localization
Think Tank

How do we train LLMs for machine translation?

The difference between NMT and LLM systems

Translation as an emergent capability

How to specialize an LLM for translation

Supervised fine-tuning with parallel data

Using LoRA (Low-Rank Adaptation)

Augmenting LLMs with Translation Memory and Terminology

The answer in a nutshell

Recent Posts

CONTACT

AI Localization Think Tank

The difference between NMT and LLM systems

Translation as an emergent capability

How to specialize an LLM for translation

Supervised fine-tuning with parallel data

Using LoRA (Low-Rank Adaptation)

Augmenting LLMs with Translation Memory and Terminology

The answer in a nutshell

AI Localization
Think Tank