Sven Laur


2020

pdf bib
EstNLTK 1.6: Remastered Estonian NLP Pipeline
Sven Laur | Siim Orasmaa | Dage Särg | Paul Tammo
Proceedings of the Twelfth Language Resources and Evaluation Conference

The goal of the EstNLTK Python library is to provide a unified programming interface for natural language processing in Estonian. As such, previous versions of the library have been immensely successful both in academic and industrial circles. However, they also contained serious structural limitations – it was hard to add new components and there was a lack of fine-grained control needed for back-end programming. These issues have been explicitly addressed in the EstNLTK library while preserving the intuitive interface for novices. We have remastered the basic NLP pipeline by adding many data cleaning steps that are necessary for analyzing real-life texts, and state of the art components for morphological analysis and fact extraction. Our evaluation on unlabelled data shows that the remastered basic NLP pipeline outperforms both the previous version of the toolkit, as well as neural models of StanfordNLP. In addition, EstNLTK contains a new interface for storing, processing and querying text objects in Postgres database which greatly simplifies processing of large text collections. EstNLTK is freely available under the GNU GPL version 2 license, which is standard for academic software.

2017

pdf bib
Linear Ensembles of Word Embedding Models
Avo Muromägi | Kairit Sirts | Sven Laur
Proceedings of the 21st Nordic Conference on Computational Linguistics

2016

pdf bib
EstNLTK - NLP Toolkit for Estonian
Siim Orasmaa | Timo Petmanson | Alexander Tkachenko | Sven Laur | Heiki-Jaan Kaalep
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Although there are many tools for natural language processing tasks in Estonian, these tools are very loosely interoperable, and it is not easy to build practical applications on top of them. In this paper, we introduce a new Python library for natural language processing in Estonian, which provides unified programming interface for various NLP components. The EstNLTK toolkit provides utilities for basic NLP tasks including tokenization, morphological analysis, lemmatisation and named entity recognition as well as offers more advanced features such as a clause segmentation, temporal expression extraction and normalization, verb chain detection, Estonian Wordnet integration and rule-based information extraction. Accompanied by a detailed API documentation and comprehensive tutorials, EstNLTK is suitable for a wide range of audience. We believe EstNLTK is mature enough to be used for developing NLP-backed systems both in industry and research. EstNLTK is freely available under the GNU GPL version 2+ license, which is standard for academic software.

2013

pdf bib
Named Entity Recognition in Estonian
Alexander Tkachenko | Timo Petmanson | Sven Laur
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing