Tanja Gaustad


2023

pdf bib
Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages
Roald Eiselen | Tanja Gaustad
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

In this paper we present a case study for three under-resourced linguistically distinct South African languages (Afrikaans, isiZulu, and Sesotho sa Leboa) to investigate the influence of data size and linguistic nature of a language on the performance of different embedding types. Our experimental setup consists of training embeddings on increasing amounts of data and then evaluating the impact of data size for the downstream task of part of speech tagging. We find that relatively little data can produce useful representations for this specific task for all three languages. Our analysis also shows that the influence of linguistic and orthographic differences between languages should not be underestimated: morphologically complex, conjunctively written languages (isiZulu in our case) need substantially more data to achieve good results, while disjunctively written languages require substantially less data. This is not only the case with regard to the data for training the embedding model, but also annotated training material for the task at hand. It is therefore imperative to know the characteristics of the language you are working on to make linguistically informed choices about the amount of data and the type of embeddings to use.

2007

pdf bib
TAT: An Author Profiling Tool with Application to Arabic Emails
Dominique Estival | Tanja Gaustad | Son Bao Pham | Will Radford | Ben Hutchinson
Proceedings of the Australasian Language Technology Workshop 2007

2004

pdf bib
A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch
Tanja Gaustad
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf bib
The importance of high-quality input for WSD: an application-oriented comparison of part-of-speech taggers
Tanja Gaustad
Proceedings of the Australasian Language Technology Workshop 2003