Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

Arijit Nag; Bidisha Samanta; Animesh Mukherjee; Niloy Ganguly; Soumen Chakrabarti

doi:10.18653/v1/2023.findings-acl.548

Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

Arijit Nag, Bidisha Samanta, Animesh Mukherjee, Niloy Ganguly, Soumen Chakrabarti

Abstract

Multilingual language models (MLLMs) like mBERTpromise to extend the benefits of NLP research to low-resource languages (LRLs). However, LRL words are under-represented in the wordpiece/subword vocabularies of MLLMs. This leads to many LRL words getting replaced by UNK, or concatenated from morphologically unrelated wordpieces, leading to low task accuracy. (Pre)-training MLLMs after including LRL documents is resource-intensive in terms of both human inputs and computational resources. In response, we propose EVALM (entropy-based vocabulary augmented language model), which uses a new task-cognizant measurement to detect the most vulnerable LRL words, whose wordpiece segmentations are undesirable. EVALM then provides reasonable initializations of their embeddings, followed by limited fine-tuning using the small LRL task corpus. Our experiments show significant performance improvements and also some surprising limits to such vocabulary augmentation strategies in various classification tasks for multiple diverse LRLs, as well as code-mixed texts. We will release the code and data to enable further research.

Anthology ID:: 2023.findings-acl.548
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8619–8629
Language:
URL:: https://aclanthology.org/2023.findings-acl.548
DOI:: 10.18653/v1/2023.findings-acl.548
Bibkey:
Cite (ACL):: Arijit Nag, Bidisha Samanta, Animesh Mukherjee, Niloy Ganguly, and Soumen Chakrabarti. 2023. Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8619–8629, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks (Nag et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.548.pdf

PDF Cite Search