Jan Strunk

2014

pdf bib abs
Untrained Forced Alignment of Transcriptions and Audio for Language Documentation Corpora using WebMAUS
Jan Strunk | Florian Schiel | Frank Seifart
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Language documentation projects supported by recent funding intiatives have created a large number of multimedia corpora of typologically diverse languages. Most of these corpora provide a manual alignment of transcription and audio data at the level of larger units, such as sentences or intonation units. Their usefulness both for corpus-linguistic and psycholinguistic research and for the development of tools and teaching materials could, however, be increased by achieving a more fine-grained alignment of transcription and audio at the word or even phoneme level. Since most language documentation corpora contain data on small languages, there usually do not exist any speech recognizers or acoustic models specifically trained on these languages. We therefore investigate the feasibility of untrained forced alignment for such corpora. We report on an evaluation of the tool (Web)MAUS (Kisler, 2012) on several language documentation corpora and discuss practical issues in the application of forced alignment. Our evaluation shows that (Web)MAUS with its existing acoustic models combined with simple grapheme-to-phoneme conversion can be successfully used for word-level forced alignment of a diverse set of languages without additional training, especially if a manual prealignment of larger annotation units is already avaible.

2010

pdf bib abs
Enriching a Treebank to Investigate Relative Clause Extraposition in German
Jan Strunk
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

I describe the construction of a corpus for research on relative clause extraposition in German based on the treebank TüBa-D/Z. I also define an annotation scheme for the relations between relative clauses and their antecedents which is added as a second annotation level to the syntactic trees. This additional annotation level allows for a direct representation of the relevant parts of the relative construction and also serves as a locus for the annotation of additional features which are partly automatically derived from the underlying treebank and partly added manually. Finally, I also report on the results of two pilot studies using this enriched treebank. The first study tests claims made in the theoretical literature on relative clause extraposition with regard to syntactic locality, definiteness, and restrictiveness. It shows that although the theoretical claims often go in the right direction, they go too far by positing categorical constraints that are not supported by the corpus data and thus underestimate the complexity of the data. The second pilot study goes one step in the direction of taking this complexity into account by demonstrating the potential of the enriched treebank for building a multivariate model of relative clause extraposition as a syntactic alternation.

Jan Strunk

2014

2010

2006

2002

Co-authors

Venues