Advancing Bangla Punctuation Restoration by a Monolingual Transformer-Based Method and a Large-Scale Corpus

Mehedi Hasan Bijoy, Mir Fatema Afroz Faria, Mahbub E Sobhani, Tanzid Ferdoush, Swakkhar Shatabda


Abstract
Punctuation restoration is the endeavor of reinstating and rectifying missing or improper punctuation marks within a text, thereby eradicating ambiguity in written discourse. The Bangla punctuation restoration task has received little attention and exploration, despitethe rising popularity of textual communication in the language. The primary hindrances in the advancement of the task revolve aroundthe utilization of transformer-based methods and an openly accessible extensive corpus, challenges that we discovered remainedunresolved in earlier efforts. In this study, we propose a baseline by introducing a mono-lingual transformer-based method named Jatikarok, where the effectiveness of transfer learning has been meticulously scrutinized, and a large-scale corpus containing 1.48M source-target pairs to resolve the previous issues. The Jatikarok attains accuracy rates of 95.2%, 85.13%, and 91.36% on the BanglaPRCorpus, Prothom-Alo Balanced, and BanglaOPUS corpora, thereby establishing itself as the state-of-the-art method through its superior performance compared to BanglaT5 and T5-Small. Jatikarok and BanglaPRCorpus are publicly available at: https://github.com/mehedihasanbijoy/Jatikarok-and-BanglaPRCorpus
Anthology ID:
2023.banglalp-1.3
Volume:
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Farig Sadeque, Ruhul Amin
Venue:
BanglaLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18–25
Language:
URL:
https://aclanthology.org/2023.banglalp-1.3
DOI:
10.18653/v1/2023.banglalp-1.3
Bibkey:
Cite (ACL):
Mehedi Hasan Bijoy, Mir Fatema Afroz Faria, Mahbub E Sobhani, Tanzid Ferdoush, and Swakkhar Shatabda. 2023. Advancing Bangla Punctuation Restoration by a Monolingual Transformer-Based Method and a Large-Scale Corpus. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 18–25, Singapore. Association for Computational Linguistics.
Cite (Informal):
Advancing Bangla Punctuation Restoration by a Monolingual Transformer-Based Method and a Large-Scale Corpus (Bijoy et al., BanglaLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.banglalp-1.3.pdf
Video:
 https://aclanthology.org/2023.banglalp-1.3.mp4