Categorisation of Bulgarian Legislative Documents

Nikola Obreshkov, Martin Yalamov, Svetla Koeva


Abstract
The paper presents the categorisation of Bulgarian MARCELL corpus in toplevel EuroVoc domains. The Bulgarian MARCELL corpus is part of a recently developed multilingual corpus representing the national legislation in seven European countries. We performed several experiments with JEX Indexer, with neural networks and with a basic method measuring the domain-specific terms in documents annotated in advance with IATE terms and EuroVoc descriptors (combined with grouping of a primary document and its satellites, term extraction and parsing of the titles of the documents). The evaluation shows slight overweight of the basic method, which makes it appropriate as the categorisation should be a module of a NLP Pipeline for Bulgarian that is continuously feeding and annotating the Bulgarian MARCELL corpus with newly issued legislative documents.
Anthology ID:
2020.clib-1.6
Volume:
Proceedings of the 4th International Conference on Computational Linguistics in Bulgaria (CLIB 2020)
Month:
September
Year:
2020
Address:
Sofia, Bulgaria
Venue:
CLIB
SIG:
Publisher:
Department of Computational Linguistics, IBL -- BAS
Note:
Pages:
53–62
Language:
URL:
https://aclanthology.org/2020.clib-1.6
DOI:
Bibkey:
Cite (ACL):
Nikola Obreshkov, Martin Yalamov, and Svetla Koeva. 2020. Categorisation of Bulgarian Legislative Documents. In Proceedings of the 4th International Conference on Computational Linguistics in Bulgaria (CLIB 2020), pages 53–62, Sofia, Bulgaria. Department of Computational Linguistics, IBL -- BAS.
Cite (Informal):
Categorisation of Bulgarian Legislative Documents (Obreshkov et al., CLIB 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.clib-1.6.pdf