Correlating Quality Evaluation Metrics

Marta Nieto Cayuela
Jun 5
2 min read

"Dear team, I've read, and enjoyed!, Marta Nieto's article on how to evaluate LLM translation quality ('Mirror, mirror, which is...'), and I have a question related to section 'Correlating metrics for scalable quality'.

Could you please share some materials on how to correlate metrics using Pearson, Spearman or Kendall methods? I haven't been able to find any related to translation quality assessment. Any example or use case would be more than welcome. Thanks a lot in advance for your help!"

—JD

Thanks so much for your question! We are glad to hear our last Quality article was both useful and enjoyable. We are planning to explore this topic further, as we believe it would make a great third article in the series.

Let’s imagine you have completed an assessment and collected automated scores and human evaluation scores. You may want to validate those automated scores to confidently monitor quality at scale, especially when you can only review a statistically significant sample with human reviewers on a recurring basis.

To correlate human scores with automated metrics (BLEU, COMET, etc.) using Pearson, Spearman, or Kendall, you will need a dataset where both types of scores are available for the same translation sets. First collect the human evaluations, then compute the automated metric scores.

If there is a Data team in your organization, this is their moment to support Localization. If you are running this yourself, once you have both sets you can use Python’s open-source SciPy library to calculate the correlation coefficients. Generally, the higher the correlation (values closer to 1), the better the metric aligns with human judgement. This is a well-studied and reported area in the annual WMT (Workshop on Machine Translation) shared tasks.

Here’s a quick example:

from scipy.stats import kendalltau

human_scores = [4, 3, 5, 2, 4]

metric_scores = [0.81, 0.76, 0.90, 0.68, 0.79]

tau, p_value = kendalltau(human_scores, metric_scores)

print(f"Kendall's tau: {tau}")

Kendall's tau: 0.9486832980505138

A few notes:

Pearson works best for linear relationships but requires data normalization if your scores are on different scales.
Spearman and Kendall are rank-based and don’t require normalization. They work well for ordinal or non-normally distributed data. If your dataset is small, Kendall is usually the best choice.
These methods are not designed for binary evaluations (e.g. pass/fail or yes/no). For those cases, different statistical approaches are recommended.

As for the sample size, there is no strict rule, but for meaningful correlation, a good starting point is at least 100-200 human-scored segments, which helps reveal real trends rather than noise. If you are just after some directional insights, Kendall Tau can work with smaller samples (as low as 50). For anything that requires to be statistically robust, aim for 300-500 segments.

We will cover all of this, and more, in an upcoming article in the Quality series.

About the Series:

As part of our "Ask the Think Tank" series, members answer reader's questions to help foster knowledge sharing and become a resource when you don't know where to turn. To submit your own question, click here.

AI Localization
Think Tank

Correlating Quality Evaluation Metrics

Recent Posts

AI Localization Think Tank

AI Localization
Think Tank