A pilot study of multi-method evaluation of machine translation in Macedonian
Journal
Computer Science and Information Systems
Date Issued
2026
Author(s)
Kuzmanova, Jana
Saints Cyril and Methodius University of Skopje
DOI
10.2298/csis251020021k
Abstract
This pilot study offers a linguistic evaluation of six machine translation
systems: GPT-4o, GPT-5, Gemini 2.5 Flash, Google Translate, Microsoft Translator, and NLLB-600M applied to the translation of a short excerpt of Orwell’s “1984”
into Macedonian. The analysis consisted of three interconnected experiments: manual annotation of translation errors and comparison with human output, evaluation
using eight popular MT metrics, and sentence-level similarity analysis via cosine
similarity, Jaccard similarity, and Levenshtein distance. Manual annotation revealed
that stylistic errors (48.47%) and linguistic errors (34.54%) were the most common.
The LLMs outperformed other systems, particularly GPT-5, while NLLB-600M performed poorly, often introducing incomprehensible sentences or non-existent words.
Metrics-based evaluation showed that lexical metrics sometimes penalized fluent
and accurate translations that deviated from the reference. Sentence similarity analysis confirmed that accurate translations were more consistent, while wrong–wrong
sentence pairs were more divergent, especially in Levenshtein scores. The findings
underscore the importance of combining manual and metric-based evaluation to
fully understand MT quality, particularly in low-resource settings.
systems: GPT-4o, GPT-5, Gemini 2.5 Flash, Google Translate, Microsoft Translator, and NLLB-600M applied to the translation of a short excerpt of Orwell’s “1984”
into Macedonian. The analysis consisted of three interconnected experiments: manual annotation of translation errors and comparison with human output, evaluation
using eight popular MT metrics, and sentence-level similarity analysis via cosine
similarity, Jaccard similarity, and Levenshtein distance. Manual annotation revealed
that stylistic errors (48.47%) and linguistic errors (34.54%) were the most common.
The LLMs outperformed other systems, particularly GPT-5, while NLLB-600M performed poorly, often introducing incomprehensible sentences or non-existent words.
Metrics-based evaluation showed that lexical metrics sometimes penalized fluent
and accurate translations that deviated from the reference. Sentence similarity analysis confirmed that accurate translations were more consistent, while wrong–wrong
sentence pairs were more divergent, especially in Levenshtein scores. The findings
underscore the importance of combining manual and metric-based evaluation to
fully understand MT quality, particularly in low-resource settings.
