To quality or not to quality: That's not the question

Marta Nieto Cayuela
Mar 30
6 min read

Updated: May 22

Redefining Localization Quality in the Age of AI Translation

According to the Oxford English Dictionary, “quality” is «the standard of something as measured against other things of a similar kind; the degree of excellence of something».

In the Localization industry, quality has traditionally been defined in terms of language correctness and message accuracy. It was straightforward to measure, relying on established language rules and the source message fidelity. Baselines were easy to create, and scores were simple to calculate. But in today’s AI-powered localization ecosystem, this definition may be too narrow.

Yet this is hardly a new debate. Let me introduce you to a discussion as old as the industry itself, and more relevant than ever: how do we measure quality? And, in the age of AI, how do we measure LLM quality?

What's quality?

First things first. If we are to talk about quality we should establish and agree on a definition. Spoiler alert: there is no universal agreement on what that definition should be.

More than ten years ago, Geoffrey S. Koby, Paul Fields, Daryl Hague, Arle Lommel and Alan Melby (Defining Translation Quality, 2014) published an article —in fact, it was a series of three articles— in which they tried to agree on the concept of quality. There was no luck in that regard, but they were able to present two different definitions: the narrow and the broad definitions of quality.

Narrow definition of quality	«A high-quality translation is one in which the message embodied in the source text is transferred completely into the target text, including denotation, connotation, nuance, and style, and the target text is written in the target language using correct grammar and word order, to produce a culturally appropriate text that, in most cases, reads as if originally written by a native speaker of the target language for readers in the target culture.»	Here is a classic example: imagine a translation that accurately conveys the original message, adhering to all linguistic rules of grammar, syntax, and cultural nuances. It is a perfectly written text and resonates in a culturally appropriate manner for the target audience. Ok, clear.
Broad definition of quality	«A quality translation demonstrates accuracy and fluency required for the audience and purpose and complies with all other specifications negotiated between the requester and provider, taking into account end-user needs.»	Now, consider a case where a translation accurately renders the original meaning but fails to integrate specific keywords for SEO optimization, ignores character limitations, disregards the brand tone, does not appeal to a specific audience segment, or overlooks the localization preferences of a regional market or business. While the translation might meet the narrow definition, it does not meet the broader expectations. The broader definition of quality, therefore, includes these added factors that are negotiated between the requester and provider, shaping the translation’s effectiveness within its context and intended use.

While the authors did not concur on which one should be universally adopted, they were able to set these two definitions apart and they stated that the broad definition was key to progress in the translation industry, and that the choice of one over the other would influence the framework for developing quality metrics and would have consequences for measuring it.

Language quality vs translation quality

I admit I have never used the concepts of “broad” and “narrow”. Instead, I have been using “language quality” (LQ) to refer to the narrow definition and “translation quality” (TQ) for the broad interpretation. I may have even used them interchangeably —guilty as charged— depending on the crowd I was addressing, but I believe we are at a turning point in which we must differentiate clearly between the two when building translation quality management frameworks in the age of AI translation.

The contrast between the two definitions becomes evident when evaluating feedback originated by any stakeholder, customer or provider in any given organization. The conversation is no longer purely linguistic and the needle has moved to assess acceptance of the target output within an audience, purpose and end-user. Quality today is required to meet the EXPERIENCE criteria.

In a world where near error-free content is at the tip of our fingers, localization quality frameworks must evolve to assess not just language accuracy but also how humans engage with brands through its localized content, how it resonates and, ultimately, what that may mean in business terms (such as clicks, conversions, and sales).

By considering the experience dimension, we embrace a 360° translation quality approach, positioning us at the forefront of a new era in localization quality: the translation quality continuum.

What is a “good experience”? What is “good enough” in translation quality?

Well, the answer lies within the broad definition of quality: «[...]complies with all other specifications negotiated between the requester and provider[...]». Let me draw your attention to one particular word. Yes, it is “negotiated”, and I will add another layer of “expectations”: «negotiated expectations».

If quality was already a subjective topic, this does not come untroubled, but it also comes with the flexibility to create something unique that serves for its specific purpose. Until now, we have been refining frameworks and metrics to align with criteria based on content type or visibility. Today, however, the focus shifts to evaluating AI-generated translations, and introduces a new variable into the audience dimension: determining whether said content will be consumed by humans or machines (e.g. SEO-optimized content, analysis reports from automated data such as weather or financials, corpora generation, web crawling. etc.)

Measuring AI-generated translation quality

❝There are no magic tricks here, only data collection.❞

With the increasing volume of content each year, continuous localization requests, shorter turnaround times, and limited budgets, organizations must understand raw MT and LLM translation quality (wink) to determine whether this new workflow fits into their content localization strategy. Key factors like visibility, utilization, risk, and business impact come into play. A common question I receive is: How confident are we in the output quality? Is it publishable?

There are no magic tricks here, only data collection. AI quality performance is a data-driven decision, supported by both automated scores and human evaluation.

In over a decade, the industry has achieved MT maturity and there are a number of ways to measure MT post-edited deliverables (BLEU, ROUGE, TER, Levenshtein, CHRF3, etc.) that can work hand-in-hand with other metrics that allow to measure the cognitive effort during the editing phase (Time To Edit). After the analyses of these scores and the human editing, the machines’ performance can be improved through different series of tech corrections. But that is not all, today we are also able to predict raw MT quality by producing quality estimation (QE) scores that estimate human-like quality based (MTQE, QPS) in a 0-1 scale.

Localization has pioneered in the adoption and customization of automated and AI solutions to evaluate MT-generated content, but what about LLMs?

How do we measure LLM translation quality?

Due to the pace of the LLM incursion in Localization to generate translations, there is not much research yet on metrics that can be used to evaluate LLM translation quality. For now, the industry leverages NMT metrics adapted to LLM with the support of human evaluation, adapted quality systems and data correlation. To list some of them, the industry is experimenting with GEMBA, COMET, BERTScore, BLEURT, MetricX and xCOMET, among others.

But should LLMs be evaluated using different translation quality scores? The differences between NMT and LLMs do, in fact, demand customized evaluation methodologies. New challenge areas have emerged, such as hallucinations, factual accuracy, and appropriateness (including bias and empathy), and these are only linguistic considerations, not accounting for intent or prompt design.

What about comparing models and workflows? Would a standardized scoring methodology be required to build a meaningful benchmark? Can we find hybrid approaches to evaluate and compare NMT and LLM translations effectively? The reality is that, despite fundamental differences in how they work and how they are built, both technologies serve the same purpose: producing translations. For now, leveraging NMT scores for LLM evaluation is acceptable — at least, it is this month.

Selecting the best score and metric for you will depend on those requirements you have internally negotiated. I know… I know you came here to know about the scores. No need for your noble mind to suffer the slings and arrows of uncertainty; we shall arm you with a wealth of knowledge in forthcoming articles. In the meantime, now that you have adopted a quality definition and have established what is “good enough” in your organization, let’s build the foundations of an LLM translation evaluation framework.

Practical application

The following table serves as a step-by-step framework for determining the most appropriate evaluation criteria and approach for your specific LLM use case. Whether you are assessing fully generated LLM content or evaluating AI-enhanced localization workflows, this guide will help you align your translation quality standards with your operational capabilities, and business objectives.

Use case	Define your LLM use case Determine whether you are working with content created entirely from scratch via prompting, localizing original content through LLM, or editing previously localized content. Ensure that your use case includes target audience and content type considerations to guide the evaluation focus.
Attributes	Describe your negotiated translation quality standards Identify the most important quality criteria for your content type or audience. Is it semantics, syntax, fluidity, fidelity, or something else? For example, does your content prioritize semantic accuracy for legal documents, or is fluidity and tone more critical for marketing material?
Score output	Interpret your data Decide on the scale of scoring (0-1, binary, or granular categories) and how to categorize errors. Consider whether a 0-1 scale indicates a simple pass and fail, or if it reflects degrees of severity.
Evaluation scope and scale	Determine the evaluation scope Evaluate your ability to perform an LLM assessment at scale. Do you have the resources to compare automated scores with human assessments, and if so, what scores and quality systems will you deploy? For example, how will you assess the statistical significance of the differences between automated scores and human evaluations? If you will be incorporating a human/direct assessment step, how many auditors will be part? Will you work with at least 2 people for an objective and unbiased approach and produce inter-rate scores, or will you just split the content and generate the average?
Languages	Assess your language requirements Consider whether the same framework can be applied to all languages. How well does your evaluation method work for high-resource languages, and what adjustments need to be made for low-resource or zero-resource languages?
Operations	Study the operational feasibility of evaluation at scale Assess the technical and cost feasibility of running the evaluation framework at scale. What are the hardware requirements (CPU vs. GPU), and how will these impact both the speed of evaluations and the cost per evaluation?
Risk Mitigation	Manage risks Reduce inconsistencies across multiple evaluations by using different scores but also securing a team that performs random, blind and objective evaluation. Mitigate human evaluation subjectivity by incorporating more than one auditor to the process and producing inter-rate scores. And, always, keep 20% of your sample data for testing.

Conclusion

Traditional quality metrics and evaluation strategies are no longer sufficient in the age of AI. Adapting and adopting combined evaluation frameworks driven by automated metrics and human evaluation can lead us to effectively manage, monitor and govern quality at scale with greater confidence and consistency. While LLM translation quality evaluation is still in the works and scores yet to be found, it is clear that quality frameworks must be broad, allowing for continuous iteration and evolution within the quality continuum.

The question is not whether we should measure quality, but how we will measure it and that answer will soon be revealed. In our next article, we will hold up the mirror to today’s scores and ask which is the fairest of them all when it comes to assessing LLM-generated translations.

#knowledge #LLMevaluation #languagequality #translationquality #qualitymanagement

Redefining Localization Quality in the Age of AI Translation