Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages

Roald Eiselen; Tanja Gaustad

doi:10.18653/v1/2023.rail-1.6

Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages

Abstract

In this paper we present a case study for three under-resourced linguistically distinct South African languages (Afrikaans, isiZulu, and Sesotho sa Leboa) to investigate the influence of data size and linguistic nature of a language on the performance of different embedding types. Our experimental setup consists of training embeddings on increasing amounts of data and then evaluating the impact of data size for the downstream task of part of speech tagging. We find that relatively little data can produce useful representations for this specific task for all three languages. Our analysis also shows that the influence of linguistic and orthographic differences between languages should not be underestimated: morphologically complex, conjunctively written languages (isiZulu in our case) need substantially more data to achieve good results, while disjunctively written languages require substantially less data. This is not only the case with regard to the data for training the embedding model, but also annotated training material for the task at hand. It is therefore imperative to know the characteristics of the language you are working on to make linguistically informed choices about the amount of data and the type of embeddings to use.

Anthology ID:: 2023.rail-1.6
Volume:: Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Rooweither Mabuya, Don Mthobela, Mmasibidi Setaka, Menno Van Zaanen
Venue:: RAIL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42–53
Language:
URL:: https://aclanthology.org/2023.rail-1.6
DOI:: 10.18653/v1/2023.rail-1.6
Bibkey:
Cite (ACL):: Roald Eiselen and Tanja Gaustad. 2023. Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages. In Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023), pages 42–53, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages (Eiselen & Gaustad, RAIL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.rail-1.6.pdf
Video:: https://aclanthology.org/2023.rail-1.6.mp4

PDF Cite Search Video