Transformer-specific Interpretability

Hosein Mohebbi, Jaap Jumelet, Michael Hanna, Afra Alishahi, Willem Zuidema


Abstract
Transformers have emerged as dominant play- ers in various scientific fields, especially NLP. However, their inner workings, like many other neural networks, remain opaque. In spite of the widespread use of model-agnostic interpretability techniques, including gradient-based and occlusion-based, their shortcomings are becoming increasingly apparent for Transformer interpretation, making the field of interpretability more demanding today. In this tutorial, we will present Transformer-specific interpretability methods, a new trending approach, that make use of specific features of the Transformer architecture and are deemed more promising for understanding Transformer-based models. We start by discussing the potential pitfalls and misleading results model-agnostic approaches may produce when interpreting Transformers. Next, we discuss Transformer-specific methods, including those designed to quantify context- mixing interactions among all input pairs (as the fundamental property of the Transformer architecture) and those that combine causal methods with low-level Transformer analysis to identify particular subnetworks within a model that are responsible for specific tasks. By the end of the tutorial, we hope participants will understand the advantages (as well as current limitations) of Transformer-specific interpretability methods, along with how these can be applied to their own research.
Anthology ID:
2024.eacl-tutorials.4
Volume:
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Mohsen Mesgar, Sharid Loáiciga
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21–26
Language:
URL:
https://aclanthology.org/2024.eacl-tutorials.4
DOI:
Bibkey:
Cite (ACL):
Hosein Mohebbi, Jaap Jumelet, Michael Hanna, Afra Alishahi, and Willem Zuidema. 2024. Transformer-specific Interpretability. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, pages 21–26, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Transformer-specific Interpretability (Mohebbi et al., EACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eacl-tutorials.4.pdf