Development of Urdu-English Religious Domain Parallel Corpus

Sadaf Abdul Rauf, Noor e Hira


Abstract
Despite the abundance of monolingual corpora accessible online, there remains a scarcity of domain specific parallel corpora. This scarcity poses a challenge in the development of robust translation systems tailored for such specialized domains. Addressing this gap, we have developed a parallel religious domain corpus for Urdu-English. This corpus consists of 18,426 parallel sentences from Sunan Daud, carefully curated to capture the unique linguistic and contextual aspects of religious texts. The developed corpus is then used to train Urdu-English religious domain Neural Machine Translation (NMT) systems, the best system scored 27.9 BLEU points
Anthology ID:
2023.mtsummit-coco4mt.2
Volume:
Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
Month:
September
Year:
2023
Address:
Macau SAR, China
Venue:
MTSummit
SIG:
Publisher:
Asia-Pacific Association for Machine Translation
Note:
Pages:
14–21
Language:
URL:
https://aclanthology.org/2023.mtsummit-coco4mt.2
DOI:
Bibkey:
Cite (ACL):
Sadaf Abdul Rauf and Noor e Hira. 2023. Development of Urdu-English Religious Domain Parallel Corpus. In Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation, pages 14–21, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):
Development of Urdu-English Religious Domain Parallel Corpus (Abdul Rauf & Hira, MTSummit 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.mtsummit-coco4mt.2.pdf