Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022

F. Balouchzahi, S. Butt, A. Hegde, N. Ashraf, H.l. Shashirekha, Grigori Sidorov, Alexander Gelbukh


Abstract
The task of Language Identification (LI) in text processing refers to automatically identifying the languages used in a text document. LI task is usually been studied at the document level and often in high-resource languages while giving less importance to low-resource languages. However, with the recent advance- ment in technologies, in a multilingual country like India, many low-resource language users post their comments using English and one or more language(s) in the form of code-mixed texts. Combination of Kannada and English is one such code-mixed text of mixing Kannada and English languages at various levels. To address the word level LI in code-mixed text, in CoLI-Kanglish shared task, we have focused on open-sourcing a Kannada-English code-mixed dataset for word level LI of Kannada, English and mixed-language words written in Roman script. The task includes classifying each word in the given text into one of six predefined categories, namely: Kannada (kn), English (en), Kannada-English (kn-en), Name (name), Lo-cation (location), and Other (other). Among the models submitted by all the participants, the best performing model obtained averaged-weighted and averaged-macro F1 scores of 0.86 and 0.62 respectively.
Anthology ID:
2022.icon-wlli.8
Volume:
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
Month:
December
Year:
2022
Address:
IIIT Delhi, New Delhi, India
Editors:
Bharathi Raja Chakravarthi, Abirami Murugappan, Dhivya Chinnappa, Adeep Hane, Prasanna Kumar Kumeresan, Rahul Ponnusamy
Venue:
ICON
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38–45
Language:
URL:
https://aclanthology.org/2022.icon-wlli.8
DOI:
Bibkey:
Cite (ACL):
F. Balouchzahi, S. Butt, A. Hegde, N. Ashraf, H.l. Shashirekha, Grigori Sidorov, and Alexander Gelbukh. 2022. Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, pages 38–45, IIIT Delhi, New Delhi, India. Association for Computational Linguistics.
Cite (Informal):
Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022 (Balouchzahi et al., ICON 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.icon-wlli.8.pdf