A Fast Method to Filter Noisy Parallel Data WMT2023 Shared Task on Parallel Data Curation

Nguyen-Hoang Minh-Cong, Nguyen Van Vinh, Nguyen Le-Minh


Abstract
The effectiveness of a machine translation (MT) system is intricately linked to the quality of its training dataset. In an era where websites offer an extensive repository of translations such as movie subtitles, stories, and TED Talks, the fundamental challenge resides in pinpointing the sentence pairs or documents that represent accurate translations of each other. This paper presents the results of our submission to the shared task WMT2023 (Sloto et al., 2023), which aimed to evaluate parallel data curation methods for improving the MT system. The task involved alignment and filtering data to create high-quality parallel corpora for training and evaluating the MT models. Our approach leveraged a combination of dictionary and rule-based methods to ensure data quality and consistency. We achieved an improvement with the highest 1.6 BLEU score compared to the baseline system. Significantly, our approach showed consistent improvements across all test sets, suggesting its efficiency.
Anthology ID:
2023.wmt-1.37
Volume:
Proceedings of the Eighth Conference on Machine Translation
Month:
December
Year:
2023
Address:
Singapore
Editors:
Philipp Koehn, Barry Haddow, Tom Kocmi, Christof Monz
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
359–365
Language:
URL:
https://aclanthology.org/2023.wmt-1.37
DOI:
10.18653/v1/2023.wmt-1.37
Bibkey:
Cite (ACL):
Nguyen-Hoang Minh-Cong, Nguyen Van Vinh, and Nguyen Le-Minh. 2023. A Fast Method to Filter Noisy Parallel Data WMT2023 Shared Task on Parallel Data Curation. In Proceedings of the Eighth Conference on Machine Translation, pages 359–365, Singapore. Association for Computational Linguistics.
Cite (Informal):
A Fast Method to Filter Noisy Parallel Data WMT2023 Shared Task on Parallel Data Curation (Minh-Cong et al., WMT 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.wmt-1.37.pdf