Maria Sukhareva


2021

pdf bib
Towards Precise Lexicon Integration in Neural Machine Translation
Ogün Öz | Maria Sukhareva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Terminological consistency is an essential requirement for industrial translation. High-quality, hand-crafted terminologies contain entries in their nominal forms. Integrating such a terminology into machine translation is not a trivial task. The MT system must be able to disambiguate homographs on the source side and choose the correct wordform on the target side. In this work, we propose a simple but effective method for homograph disambiguation and a method of wordform selection by introducing multi-choice lexical constraints. We also propose a metric to measure the terminological consistency of the translation. Our results have a significant improvement over the current SOTA in terms of terminological consistency without any loss of the BLEU score. All the code used in this work will be published as open-source.

2020

pdf bib
Context-Aware Text Normalisation for Historical Dialects
Maria Sukhareva
Proceedings of the 28th International Conference on Computational Linguistics

Context-aware historical text normalisation is a severely under-researched area. To fill the gap we propose a context-aware normalisation approach that relies on the state-of-the-art methods in neural machine translation and transfer learning. We propose a multidialect normaliser with a context-aware reranking of the candidates. The reranker relies on a word-level n-gram language model that is applied to the five best normalisation candidates. The results are evaluated on the historical multidialect datasets of German, Spanish, Portuguese and Slovene. We show that incorporating dialectal information into the training leads to an accuracy improvement on all the datasets. The context-aware reranking gives further improvement over the baseline. For three out of six datasets, we reach a significantly higher accuracy than reported in the previous studies. The other three results are comparable with the current state-of-the-art. The code for the reranker is published as open-source.

2019

pdf bib
A Streamlined Method for Sourcing Discourse-level Argumentation Annotations from the Crowd
Tristan Miller | Maria Sukhareva | Iryna Gurevych
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The study of argumentation and the development of argument mining tools depends on the availability of annotated data, which is challenging to obtain in sufficient quantity and quality. We present a method that breaks down a popular but relatively complex discourse-level argument annotation scheme into a simpler, iterative procedure that can be applied even by untrained annotators. We apply this method in a crowdsourcing setup and report on the reliability of the annotations obtained. The source code for a tool implementing our annotation method, as well as the sample data we obtained (4909 gold-standard annotations across 982 documents), are freely released to the research community. These are intended to serve the needs of qualitative research into argumentation, as well as of data-driven approaches to argument mining.

2018

pdf bib
Analyzing Middle High German Syntax with RDF and SPARQL
Christian Chiarcos | Benjamin Kosmehl | Christian Fäth | Maria Sukhareva
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Machine Translation and Automated Analysis of the Sumerian Language
Émilie Pagé-Perron | Maria Sukhareva | Ilya Khait | Christian Chiarcos
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents a newly funded international project for machine translation and automated analysis of ancient cuneiform languages where NLP specialists and Assyriologists collaborate to create an information retrieval system for Sumerian. This research is conceived in response to the need to translate large numbers of administrative texts that are only available in transcription, in order to make them accessible to a wider audience. The methodology includes creation of a specialized NLP pipeline and also the use of linguistic linked open data to increase access to the results.

pdf bib
Distantly Supervised POS Tagging of Low-Resource Languages under Extreme Data Sparsity: The Case of Hittite
Maria Sukhareva | Francesco Fuscagni | Johannes Daxenberger | Susanne Görke | Doris Prechel | Iryna Gurevych
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents a statistical approach to automatic morphosyntactic annotation of Hittite transcripts. Hittite is an extinct Indo-European language using the cuneiform script. There are currently no morphosyntactic annotations available for Hittite, so we explored methods of distant supervision. The annotations were projected from parallel German translations of the Hittite texts. In order to reduce data sparsity, we applied stemming of German and Hittite texts. As there is no off-the-shelf Hittite stemmer, a stemmer for Hittite was developed for this purpose. The resulting annotation projections were used to train a POS tagger, achieving an accuracy of 69% on a test sample. To our knowledge, this is the first attempt of statistical POS tagging of a cuneiform language.

2016

pdf bib
Combining Ontologies and Neural Networks for Analyzing Historical Language Varieties. A Case Study in Middle Low German
Maria Sukhareva | Christian Chiarcos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we describe experiments on the morphosyntactic annotation of historical language varieties for the example of Middle Low German (MLG), the official language of the German Hanse during the Middle Ages and a dominant language around the Baltic Sea by the time. To our best knowledge, this is the first experiment in automatically producing morphosyntactic annotations for Middle Low German, and accordingly, no part-of-speech (POS) tagset is currently agreed upon. In our experiment, we illustrate how ontology-based specifications of projected annotations can be employed to circumvent this issue: Instead of training and evaluating against a given tagset, we decomponse it into independent features which are predicted independently by a neural network. Using consistency constraints (axioms) from an ontology, then, the predicted feature probabilities are decoded into a sound ontological representation. Using these representations, we can finally bootstrap a POS tagset capturing only morphosyntactic features which could be reliably predicted. In this way, our approach is capable to optimize precision and recall of morphosyntactic annotations simultaneously with bootstrapping a tagset rather than performing iterative cycles.

pdf bib
Crowdsourcing a Large Dataset of Domain-Specific Context-Sensitive Semantic Verb Relations
Maria Sukhareva | Judith Eckle-Kohler | Ivan Habernal | Iryna Gurevych
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a new large dataset of 12403 context-sensitive verb relations manually annotated via crowdsourcing. These relations capture fine-grained semantic information between verb-centric propositions, such as temporal or entailment relations. We propose a novel semantic verb relation scheme and design a multi-step annotation approach for scaling-up the annotations using crowdsourcing. We employ several quality measures and report on agreement scores. The resulting dataset is available under a permissive CreativeCommons license at www.ukp.tu-darmstadt.de/data/verb-relations/. It represents a valuable resource for various applications, such as automatic information consolidation or automatic summarization.

2015

pdf bib
An Ontology-based Approach To Automatic Part-of-Speech Tagging Using Heterogeneously Annotated Corpora
Maria Sukhareva | Christian Chiarcos
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data

pdf bib
Towards the Unsupervised Acquisition of Implicit Semantic Roles
Niko Schenk | Christian Chiarcos | Maria Sukhareva
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
New Technologies for Old Germanic. Resources and Research on Parallel Bibles in Older Continental Western Germanic
Christian Chiarcos | Maria Sukhareva | Roland Mittmann | Timothy Price | Gaye Detmold | Jan Chobotsky
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Diachronic proximity vs. data sparsity in cross-lingual parser projection. A case study on Germanic
Maria Sukhareva | Christian Chiarcos
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects