Jan Strunk


2014

pdf bib
Untrained Forced Alignment of Transcriptions and Audio for Language Documentation Corpora using WebMAUS
Jan Strunk | Florian Schiel | Frank Seifart
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Language documentation projects supported by recent funding intiatives have created a large number of multimedia corpora of typologically diverse languages. Most of these corpora provide a manual alignment of transcription and audio data at the level of larger units, such as sentences or intonation units. Their usefulness both for corpus-linguistic and psycholinguistic research and for the development of tools and teaching materials could, however, be increased by achieving a more fine-grained alignment of transcription and audio at the word or even phoneme level. Since most language documentation corpora contain data on small languages, there usually do not exist any speech recognizers or acoustic models specifically trained on these languages. We therefore investigate the feasibility of untrained forced alignment for such corpora. We report on an evaluation of the tool (Web)MAUS (Kisler, 2012) on several language documentation corpora and discuss practical issues in the application of forced alignment. Our evaluation shows that (Web)MAUS with its existing acoustic models combined with simple grapheme-to-phoneme conversion can be successfully used for word-level forced alignment of a diverse set of languages without additional training, especially if a manual prealignment of larger annotation units is already avaible.

2010

pdf bib
An Annotation Schema for Preposition Senses in German
Antje Müller | Olaf Hülscher | Claudia Roch | Katja Keßelmeier | Tobias Stadtfeld | Jan Strunk | Tibor Kiss
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
A Logistic Regression Model of Determiner Omission in PPs
Tibor Kiss | Katja Keßelmeier | Antje Müller | Claudia Roch | Tobias Stadtfeld | Jan Strunk
Coling 2010: Posters

pdf bib
Enriching a Treebank to Investigate Relative Clause Extraposition in German
Jan Strunk
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

I describe the construction of a corpus for research on relative clause extraposition in German based on the treebank TüBa-D/Z. I also define an annotation scheme for the relations between relative clauses and their antecedents which is added as a second annotation level to the syntactic trees. This additional annotation level allows for a direct representation of the relevant parts of the relative construction and also serves as a locus for the annotation of additional features which are partly automatically derived from the underlying treebank and partly added manually. Finally, I also report on the results of two pilot studies using this enriched treebank. The first study tests claims made in the theoretical literature on relative clause extraposition with regard to syntactic locality, definiteness, and restrictiveness. It shows that although the theoretical claims often go in the right direction, they go too far by positing categorical constraints that are not supported by the corpus data and thus underestimate the complexity of the data. The second pilot study goes one step in the direction of taking this complexity into account by demonstrating the potential of the enriched treebank for building a multivariate model of relative clause extraposition as a syntactic alternation.

2006

pdf bib
Unsupervised Multilingual Sentence Boundary Detection
Tibor Kiss | Jan Strunk
Computational Linguistics, Volume 32, Number 4, December 2006

2002

pdf bib
Scaled Log Likelihood Ratios for the Detection of Abbreviations in Text Corpora
Tibor Kiss | Jan Strunk
COLING 2002: The 17th International Conference on Computational Linguistics: Project Notes