Where are we Still Split on Tokenization?

Rob Goot


Abstract
Many Natural Language Processing (NLP) tasks are labeled on the token level, forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often considered to be a solved problem, gold tokenization iscommonly assumed. In this paper, we propose an efficient method fortokenization with subword-based language models, and reflect on the status ofperformance on the tokenization task by evaluating on 122 languages in 20different scripts. We show that our proposed model performs on par with thestate-of-the-art, and that tokenization performance is mainly dependent on theamount and consistency of annotated data. We conclude that besidesinconsistencies in the data and exceptional cases the task can be consideredsolved for Latin languages for in-dataset settings (>99.5 F1). However,performance is 0.75 F1 point lower on average for datasets in other scripts andperformance deteriorates in cross-dataset setups.
Anthology ID:
2024.findings-eacl.9
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
118–137
Language:
URL:
https://aclanthology.org/2024.findings-eacl.9
DOI:
Bibkey:
Cite (ACL):
Rob Goot. 2024. Where are we Still Split on Tokenization?. In Findings of the Association for Computational Linguistics: EACL 2024, pages 118–137, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Where are we Still Split on Tokenization? (Goot, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.9.pdf
Software:
 2024.findings-eacl.9.software.tgz