Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses

Xenia Ohmer, Elia Bruni, Dieuwke Hupkes


Abstract
At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel paradigm for evaluating LLMs which leverages the idea that correct world understanding should be consistent across different (Fregean) senses of the same meaning. Accordingly, we measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. We showcase our approach by instantiating a test where the different senses are different languages, hence using multilingual self-consistency as a litmus test for the model’s understanding and simultaneously addressing the important topic of multilingualism. Taking one of the latest versions of ChatGPT as our object of study, we evaluate multilingual consistency for two different tasks across three different languages. We show that its multilingual consistency is still lacking, and that its task and world understanding are thus not language-independent. As our approach does not require any static evaluation corpora in languages other than English, it can easily and cheaply be extended to different languages and tasks and could become an integral part of future benchmarking efforts.
Anthology ID:
2023.gem-1.22
Volume:
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, Hooman Sedghamiz
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
258–276
Language:
URL:
https://aclanthology.org/2023.gem-1.22
DOI:
Bibkey:
Cite (ACL):
Xenia Ohmer, Elia Bruni, and Dieuwke Hupkes. 2023. Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 258–276, Singapore. Association for Computational Linguistics.
Cite (Informal):
Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses (Ohmer et al., GEM-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.gem-1.22.pdf