MuLER: Detailed and Scalable Reference-based Evaluation

Taelin Karidi, Leshem Choshen, Gal Patel, Omri Abend


Abstract
We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER’s validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.
Anthology ID:
2023.conll-1.29
Volume:
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Jing Jiang, David Reitter, Shumin Deng
Venue:
CoNLL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
436–455
Language:
URL:
https://aclanthology.org/2023.conll-1.29
DOI:
10.18653/v1/2023.conll-1.29
Bibkey:
Cite (ACL):
Taelin Karidi, Leshem Choshen, Gal Patel, and Omri Abend. 2023. MuLER: Detailed and Scalable Reference-based Evaluation. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 436–455, Singapore. Association for Computational Linguistics.
Cite (Informal):
MuLER: Detailed and Scalable Reference-based Evaluation (Karidi et al., CoNLL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.conll-1.29.pdf