Timm Lehmberg


2022

pdf bib
Bringing Together Version Control and Quality Assurance of Language Data with LAMA
Aleksandr Riaposov | Elena Lazarenko | Timm Lehmberg
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

This contribution reports on work in process on project specific software and digital infrastructure components used along with corpus curation workflows in the the framework of the long-term language documentation project INEL. By bringing together scientists with different levels of technical affinity in a highly interdisciplinary working environment, the project is confronted with numerous workflow related issues. Many of them result from collaborative (remote-)work on digital corpora, which, among other things, include annotation, glossing but also quality- and consistency control. In this context several steps were taken to bridge the gap between usability and the requirements of complex data curation workflows. Components of the latter such as a versioning system and semi-automated data validators on one side meet the user demands for the simplicity and minimalism on the other side. Embodying a simple shell script in an interactive graphic user interface, we augment the efficacy of the data versioning and the integration of Java-based quality control and validation tools.

2020

pdf bib
Towards Flexible Cross-Resource Exploitation of Heterogeneous Language Documentation Data
Daniel Jettka | Timm Lehmberg
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper reports on challenges and solution approaches in the development of methods for language resource overarching data analysis in the field of language documentation. It is based on the successful outcomes of the initial phase of an 18 year long-term project on lesser resourced and mostly endangered indigenous languages of the Northern Eurasian area, which included the finalization and publication of multiple language corpora and additional language resources. While aiming at comprehensive cross-resource data analysis, the project at the same time is confronted with a dynamic and complex resource landscape, especially resulting from a vast amount of multi-layered information stored in the form of analogue primary data in different widespread archives on the territory of the Russian Federation. The methods described aim at solving the tension between unification of data sets and vocabularies on the one hand and maximum openness for the integration of future resources and adaption of external information on the other hand.

2018

pdf bib
Introducing the CLARIN Knowledge Centre for Linguistic Diversity and Language Documentation
Hanna Hedeland | Timm Lehmberg | Felix Rau | Sophie Salffner | Mandana Seyfeddinipur | Andreas Witt
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2008

pdf bib
The Metadata-Database of a Next Generation Sustainability Web-Platform for Language Resources
Georg Rehm | Oliver Schonefeld | Andreas Witt | Timm Lehmberg | Christian Chiarcos | Hanan Bechara | Florian Eishold | Kilian Evang | Magdalena Leshtanska | Aleksandar Savkov | Matthias Stark
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.