Kenneth Steimel


2023

pdf bib
Towards a Swahili Universal Dependency Treebank: Leveraging the Annotations of the Helsinki Corpus of Swahili
Kenneth Steimel | Sandra Kübler
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

Dependency annotation can be a laborious process for under-resourced languages. However, in some cases, other resources are available. We investigate whether we can leverage such resources in the case of Swahili: We use the Helsinki Corpus of Swahili for creating a Universal Depedencies treebank for Swahili. The Helsinki Corpus of Swahili provides word-level annotations for part of speech tags, morphological features, and functional syntactic tags. We train neural taggers for these types of annotations, then use those models to annotate our target corpus, the Swahili portion of the OPUS Global Voices Corpus. Based on those annotations, we then manually create constraint grammar rules to annotate the target corpus for Universal Dependencies. In this paper, we describe the process, discuss the annotation decisions we had to make, and we evaluate the approach.

pdf bib
Beyond the Repo: A Case Study on Open Source Integration with GECToR
Sanjna Kashyap | Zhaoyang Xie | Kenneth Steimel | Nitin Madnani
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

We present a case study describing our efforts to integrate the open source GECToR code and models into our production NLP pipeline that powers many of Educational Testing Service’s products and prototypes. The paper’s contributions includes a discussion of the issues we encountered during integration and our solutions, the overarching lessons we learned about integrating open source projects, and, last but not least, the open source contributions we made as part of the journey.

2021

pdf bib
Morphology Matters: A Multilingual Language Modeling Analysis
Hyunji Hayley Park | Katherine J. Zhang | Coleman Haley | Kenneth Steimel | Han Liu | Lane Schwartz
Transactions of the Association for Computational Linguistics, Volume 9

Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling.

2020

pdf bib
Fine-Grained Morpho-Syntactic Analysis for the Under-Resourced Language Chaghatay
Kenneth Steimel | Akbar Amat | Arienne Dwyer | Sandra Kübler
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf bib
Investigating Multilingual Abusive Language Detection: A Cautionary Tale
Kenneth Steimel | Daniel Dakota | Yue Chen | Sandra Kübler
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Abusive language detection has received much attention in the last years, and recent approaches perform the task in a number of different languages. We investigate which factors have an effect on multilingual settings, focusing on the compatibility of data and annotations. In the current paper, we focus on English and German. Our findings show large differences in performance between the two languages. We find that the best performance is achieved by different classification algorithms. Sampling to address class imbalance issues is detrimental for German and beneficial for English. The only similarity that we find is that neither data set shows clear topics when we compare the results of topic modeling to the gold standard. Based on our findings, we can conclude that a multilingual optimization of classifiers is not possible even in settings where comparable data sets are used.

2018

pdf bib
Part of Speech Tagging in Luyia: A Bantu Macrolanguage
Kenneth Steimel
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

Luyia is a macrolanguage in central Kenya. The Luyia languages, like other Bantu languages, have a complex morphological system. This system can be leveraged to aid in part of speech tagging. Bag-of-characters taggers trained on a source Luyia language can be applied directly to another Luyia language with some degree of success. In addition, mixing data from the target language with data from the source language does produce more accurate predictive models compared to models trained on just the target language data when the training set size is small. However, for both of these tagging tasks, models involving the more distantly related language, Tiriki, are better at predicting part of speech tags for Wanga data. The models incorporating Bukusu data are not as successful despite the closer relationship between Bukusu and Wanga. Overlapping vocabulary between the Wanga and Tiriki corpora as well as a bias towards open class words help Tiriki outperform Bukusu.

2016

pdf bib
IUCL at SemEval-2016 Task 6: An Ensemble Model for Stance Detection in Twitter
Can Liu | Wen Li | Bradford Demarest | Yue Chen | Sara Couture | Daniel Dakota | Nikita Haduong | Noah Kaufman | Andrew Lamont | Manan Pancholi | Kenneth Steimel | Sandra Kübler
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)