Bridging the Gap between Subword and Character Segmentation in Pretrained Language Models

Shun Kiyono, Sho Takase, Shengzhe Li, Toshinori Sato


Abstract
Pretrained language models require the use of consistent segmentation (e.g., subword- or character-level segmentation) in pretraining and finetuning. In NLP, many tasks are modeled by subword-level segmentation better than by character-level segmentation. However, because of their format, several tasks require the use of character-level segmentation. Thus, in order to tackle both types of NLP tasks, language models must be independently pretrained for both subword and character-level segmentation. However, this is an inefficient and costly procedure. Instead, this paper proposes a method for training a language model with unified segmentation. This means that the trained model can be finetuned on both subword- and character-level segmentation. The principle of the method is to apply the subword regularization technique to generate a mixture of subword- and character-level segmentation. Through experiment on BERT models, we demonstrate that our method can halve the computational cost of pretraining.
Anthology ID:
2023.ranlp-1.62
Volume:
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Month:
September
Year:
2023
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
568–577
Language:
URL:
https://aclanthology.org/2023.ranlp-1.62
DOI:
Bibkey:
Cite (ACL):
Shun Kiyono, Sho Takase, Shengzhe Li, and Toshinori Sato. 2023. Bridging the Gap between Subword and Character Segmentation in Pretrained Language Models. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 568–577, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
Bridging the Gap between Subword and Character Segmentation in Pretrained Language Models (Kiyono et al., RANLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ranlp-1.62.pdf