Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions

Jenny Kunz, Martin Jirenius, Oskar Holmström, Marco Kuhlmann


Abstract
Models able to generate free-text rationales that explain their output have been proposed as an important step towards interpretable NLP for “reasoning” tasks such as natural language inference and commonsense question answering. However, the relative merits of different architectures and types of rationales are not well understood and hard to measure. In this paper, we contribute two insights to this line of research: First, we find that models trained on gold explanations learn to rely on these but, in the case of the more challenging question answering data set we use, fail when given generated explanations at test time. However, additional fine-tuning on generated explanations teaches the model to distinguish between reliable and unreliable information in explanations. Second, we compare explanations by a generation-only model to those generated by a self-rationalizing model and find that, while the former score higher in terms of validity, factual correctness, and similarity to gold explanations, they are not more useful for downstream classification. We observe that the self-rationalizing model is prone to hallucination, which is punished by most metrics but may add useful context for the classification step.
Anthology ID:
2022.blackboxnlp-1.14
Volume:
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, Sarah Wiegreffe
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
164–177
Language:
URL:
https://aclanthology.org/2022.blackboxnlp-1.14
DOI:
10.18653/v1/2022.blackboxnlp-1.14
Bibkey:
Cite (ACL):
Jenny Kunz, Martin Jirenius, Oskar Holmström, and Marco Kuhlmann. 2022. Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 164–177, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions (Kunz et al., BlackboxNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.blackboxnlp-1.14.pdf
Video:
 https://aclanthology.org/2022.blackboxnlp-1.14.mp4