Synthetic Pre-Training Tasks for Neural Machine Translation

Zexue He, Graeme Blackwood, Rameswar Panda, Julian McAuley, Rogerio Feris


Abstract
Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data. We hope the findings from our comprehensive empirical analysis will shed light on understanding what matters for NMT pre-training, as well as pave the way for the development of more efficient and less toxic models.
Anthology ID:
2023.findings-acl.512
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8080–8098
Language:
URL:
https://aclanthology.org/2023.findings-acl.512
DOI:
10.18653/v1/2023.findings-acl.512
Bibkey:
Cite (ACL):
Zexue He, Graeme Blackwood, Rameswar Panda, Julian McAuley, and Rogerio Feris. 2023. Synthetic Pre-Training Tasks for Neural Machine Translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8080–8098, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Synthetic Pre-Training Tasks for Neural Machine Translation (He et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.512.pdf
Video:
 https://aclanthology.org/2023.findings-acl.512.mp4