An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages

Aquia Richburg, Ramy Eskander, Smaranda Muresan, Marine Carpuat


Abstract
Byte-Pair Encoding (BPE) (Sennrich et al., 2016) has become a standard pre-processing step when building neural machine translation systems. However, it is not clear whether this is an optimal strategy in all settings. We conduct a controlled comparison of subword segmentation strategies for translating two low-resource morphologically rich languages (Swahili and Turkish) into English. We show that segmentations based on a unigram language model (Kudo, 2018) yield comparable BLEU and better recall for translating rare source words than BPE.
Anthology ID:
2020.winlp-1.40
Volume:
Proceedings of the Fourth Widening Natural Language Processing Workshop
Month:
July
Year:
2020
Address:
Seattle, USA
Editors:
Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
Venue:
WiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
151–155
Language:
URL:
https://aclanthology.org/2020.winlp-1.40
DOI:
10.18653/v1/2020.winlp-1.40
Bibkey:
Cite (ACL):
Aquia Richburg, Ramy Eskander, Smaranda Muresan, and Marine Carpuat. 2020. An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 151–155, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages (Richburg et al., WiNLP 2020)
Copy Citation:
Video:
 http://slideslive.com/38929580