The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations

Manex Agirrezabal, Sidsel Boldsen, Nora Hollenstein


Abstract
To gain a better understanding of the linguistic information encoded in character-based language models, we probe the multilingual contextual CANINE model. We design a range of phonetic probing tasks in six Nordic languages, including Faroese as an additional zero-shot instance. We observe that some phonetic information is indeed encoded in the character representations, as consonants and vowels can be well distinguished using a linear classifier. Furthermore, results for the Danish and Norwegian language seem to be worse for the consonant/vowel distinction in comparison to other languages. The information encoded in these representations can also be learned in a zero-shot scenario, as Faroese shows a reasonably good performance in the same vowel/consonant distinction task.
Anthology ID:
2023.cawl-1.2
Volume:
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Kyle Gorman, Richard Sproat, Brian Roark
Venue:
CAWL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6–13
Language:
URL:
https://aclanthology.org/2023.cawl-1.2
DOI:
10.18653/v1/2023.cawl-1.2
Bibkey:
Cite (ACL):
Manex Agirrezabal, Sidsel Boldsen, and Nora Hollenstein. 2023. The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations. In Proceedings of the Workshop on Computation and Written Language (CAWL 2023), pages 6–13, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations (Agirrezabal et al., CAWL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.cawl-1.2.pdf