The DA-ELEXIS Corpus - a Sense-Annotated Corpus for Danish with Parallel Annotations for Nine European Languages

Bolette Pedersen, Sanni Nimb, Sussi Olsen, Thomas Troelsgård, Ida Flörke, Jonas Jensen, Henrik Lorentzen


Abstract
In this paper, we present the newly compiled DA-ELEXIS Corpus, which is one of the largest sense-annotated corpora available for Danish, and the first one to be annotated with the Danish wordnet, DanNet. The corpus is part of a European initiative, the ELEXIS project, and has corresponding parallel annotations in nine other European languages. As such it functions as a cross-lingual evaluative benchmark for a series of low and medium resourced European language. We focus here on the Danish annotation process, i.e. on the annotation scheme including annotation guidelines and a primary sense inventory constituted by DanNet as well as the fall-back sense inventory namely The Danish Dictionary (DDO). We analyse and discuss issues such as out of vocabulary (OOV) problems, problems with sense granularity and missing senses (in particular for verbs), and how to semantically tag multiword expressions (MWE), which prove to occur very frequently in the Danish corpus. Finally, we calculate the inter-annotator agreement (IAA) and show how IAA has improved during the annotation process. The openly available corpus contains 32,524 tokens of which sense annotations are given for all content words, amounting to 7,322 nouns, 3,099 verbs, 2,626 adjectives, and 1,677 adverbs.
Anthology ID:
2023.resourceful-1.2
Volume:
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)
Month:
May
Year:
2023
Address:
Tórshavn, the Faroe Islands
Editors:
Nikolai Ilinykh, Felix Morger, Dana Dannélls, Simon Dobnik, Beáta Megyesi, Joakim Nivre
Venue:
RESOURCEFUL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–18
Language:
URL:
https://aclanthology.org/2023.resourceful-1.2
DOI:
Bibkey:
Cite (ACL):
Bolette Pedersen, Sanni Nimb, Sussi Olsen, Thomas Troelsgård, Ida Flörke, Jonas Jensen, and Henrik Lorentzen. 2023. The DA-ELEXIS Corpus - a Sense-Annotated Corpus for Danish with Parallel Annotations for Nine European Languages. In Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), pages 11–18, Tórshavn, the Faroe Islands. Association for Computational Linguistics.
Cite (Informal):
The DA-ELEXIS Corpus - a Sense-Annotated Corpus for Danish with Parallel Annotations for Nine European Languages (Pedersen et al., RESOURCEFUL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.resourceful-1.2.pdf