Dag Haug

2023

pdf bib abs
Rules and neural nets for morphological tagging of Norwegian - Results and challenges
Dag Haug | Ahmet Yildirim | Kristin Hagen | Anders Nøklestad
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

This paper reports on efforts to improve the Oslo-Bergen Tagger for Norwegian morphological tagging. We train two deep neural network-based taggers using the recently introduced Norwegian pre-trained encoder (a BERT model for Norwegian). The first network is a sequence-to-sequence encoder-decoder and the second is a sequence classifier. We test both these configurations in a hybrid system where they combine with the existing rule-based system, and on their own. The sequence-to-sequence system performs better in the hybrid configuration, but the classifier system performs so well that combining it with the rules is actually slightly detrimental to performance.

pdf bib abs
The long and the short of it: DRASTIC, a semantically annotated dataset containing sentences of more natural length
Dag Haug | Jamie Yates Findlay | Ahmet Yildirim
Proceedings of the Fourth International Workshop on Designing Meaning Representations

This paper presents a new dataset with Discourse Representation Structures (DRSs) annotated over naturally-occurring sentences. Importantly, these sentences are more varied in length and on average longer than those in the existing gold-standard DRS dataset, the Parallel Meaning Bank, and we show that they are therefore much harder for parsers. We argue, though, that this provides a more realistic assessment of the difficulties of DRS parsing.

pdf bib abs
Experiments in training transformer sequence-to-sequence DRS parsers
Ahmet Yildirim | Dag Haug
Proceedings of the 15th International Conference on Computational Semantics

This work experiments with various configurations of transformer-based sequence-to-sequence neural networks in training a Discourse Representation Structure (DRS) parser, and presents the results along with the code to reproduce our experiments for use by the community working on DRS parsing. These are configurations that have not been tested in prior work on this task. The Parallel Meaning Bank (PMB) English data sets are used to train the models. The results are evaluated on the PMB test sets using Counter, the standard Evaluation tool for DRSs. We show that the performance improves upon the previous state of the art by 0.5 (F1 %) for PMB 2.2.0 and 1.02 (F1 %) for PMB 3.0.0 test sets. We also present results on PMB 4.0.0, which has not been evaluated using Counter in previous research.

2022

We present the Norwegian Anaphora Resolution Corpus (NARC), the first publicly available corpus annotated with anaphoric relations between noun phrases for Norwegian. The paper describes the annotated data for 326 documents in Norwegian Bokmål, together with inter-annotator agreement and discussions of relevant statistics. We also present preliminary modelling results which are comparable to existing corpora for other languages, and discuss relevant problems in relation to both modelling and the annotations themselves.

2018

Although treebanks annotated according to the guidelines of Universal Dependencies (UD) now exist for many languages, the goal of annotating the same phenomena in a cross-linguistically consistent fashion is not always met. In this paper, we investigate one phenomenon where we believe such consistency is lacking, namely expletive elements. Such elements occupy a position that is structurally associated with a core argument (or sometimes an oblique dependent), yet are non-referential and semantically void. Many UD treebanks identify at least some elements as expletive, but the range of phenomena differs between treebanks, even for closely related languages, and sometimes even for different treebanks for the same language. In this paper, we present criteria for identifying expletives that are applicable across languages and compatible with the goals of UD, give an overview of expletives as found in current UD treebanks, and present recommendations for the annotation of expletives so that more consistent annotation can be achieved in future releases.

2010

pdf bib abs
Porting an Ancient Greek and Latin Treebank
John Lee | Dag Haug
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We have recently converted a dependency treebank, consisting of ancient Greek and Latin texts, from one annotation scheme to another that was independently designed. This paper makes two observations about this conversion process. First, we show that, despite significant surface differences between the two treebanks, a number of straightforward transformation rules yield a substantial level of compatibility between them, giving evidence for their sound design and high quality of annotation. Second, we analyze some linguistic annotations that require further disambiguation, proposing some simple yet effective machine learning methods.