SOTASTREAM: A Streaming Approach to Machine Translation Training

Matt Post, Thamme Gowda, Roman Grundkiewicz, Huda Khayrallah, Rohit Jain, Marcin Junczys-Dowmunt


Abstract
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models.
Anthology ID:
2023.nlposs-1.13
Volume:
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Liling Tan, Dmitrijs Milajevs, Geeticka Chauhan, Jeremy Gwinnup, Elijah Rippeth
Venues:
NLPOSS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
110–119
Language:
URL:
https://aclanthology.org/2023.nlposs-1.13
DOI:
10.18653/v1/2023.nlposs-1.13
Bibkey:
Cite (ACL):
Matt Post, Thamme Gowda, Roman Grundkiewicz, Huda Khayrallah, Rohit Jain, and Marcin Junczys-Dowmunt. 2023. SOTASTREAM: A Streaming Approach to Machine Translation Training. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 110–119, Singapore. Association for Computational Linguistics.
Cite (Informal):
SOTASTREAM: A Streaming Approach to Machine Translation Training (Post et al., NLPOSS-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nlposs-1.13.pdf
Video:
 https://aclanthology.org/2023.nlposs-1.13.mp4