The AST Submission for the CoCo4MT 2023 Shared Task on Corpus Construction for Low-Resource Machine Translation

Steinþór Steingrímsson


Abstract
We describe the AST submission for the CoCo4MT 2023 shared task. The aim of the task is to identify the best candidates for translation in a source data set with the aim to use the translated parallel data for fine-tuning the mBART-50 model. We experiment with three methods: scoring sentences based on n-gram coverage, using LaBSE to estimate semantic similarity and identify misalignments and mistranslations by comparing machine translated source sentences to corresponding manually translated segments in high-resource languages. We find that we obtain the best results by combining these three methods, using LaBSE and machine translation for filtering, and one of our n-gram scoring approaches for ordering sentences.
Anthology ID:
2023.mtsummit-coco4mt.5
Volume:
Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
Month:
September
Year:
2023
Address:
Macau SAR, China
Venue:
MTSummit
SIG:
Publisher:
Asia-Pacific Association for Machine Translation
Note:
Pages:
33–38
Language:
URL:
https://aclanthology.org/2023.mtsummit-coco4mt.5
DOI:
Bibkey:
Cite (ACL):
Steinþór Steingrímsson. 2023. The AST Submission for the CoCo4MT 2023 Shared Task on Corpus Construction for Low-Resource Machine Translation. In Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation, pages 33–38, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):
The AST Submission for the CoCo4MT 2023 Shared Task on Corpus Construction for Low-Resource Machine Translation (Steingrímsson, MTSummit 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.mtsummit-coco4mt.5.pdf