Analyzing Cognitive Plausibility of Subword Tokenization

Lisa Beinborn, Yuval Pinter


Abstract
Subword tokenization has become the de-facto standard for tokenization although comparative evaluations of their quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the reading time and accuracy of human responses on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the Unigram algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.
Anthology ID:
2023.emnlp-main.272
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4478–4486
Language:
URL:
https://aclanthology.org/2023.emnlp-main.272
DOI:
10.18653/v1/2023.emnlp-main.272
Bibkey:
Cite (ACL):
Lisa Beinborn and Yuval Pinter. 2023. Analyzing Cognitive Plausibility of Subword Tokenization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4478–4486, Singapore. Association for Computational Linguistics.
Cite (Informal):
Analyzing Cognitive Plausibility of Subword Tokenization (Beinborn & Pinter, EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.272.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.272.mp4