Maxim Ionov


2023

pdf bib
Beyond Concatenative Morphology: Applying OntoLex-Morph to Maltese
Maxim Ionov | Mike Rosner
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

pdf bib
Querying a Dozen Corpora and a Thousand Years with Fintan
Christian Chiarcos | Christian Fäth | Maxim Ionov
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Large-scale diachronic corpus studies covering longer time periods are difficult if more than one corpus are to be consulted and, as a result, different formats and annotation schemas need to be processed and queried in a uniform, comparable and replicable manner. We describes the application of the Flexible Integrated Transformation and Annotation eNgineering (Fintan) platform for studying word order in German using syntactically annotated corpora that represent its entire written history. Focusing on nominal dative and accusative arguments, this study hints at two major phases in the development of scrambling in modern German. Against more recent assumptions, it supports the traditional view that word order flexibility decreased over time, but it also indicates that this was a relatively sharp transition in Early New High German. The successful case study demonstrates the potential of Fintan and the underlying LLOD technology for historical linguistics, linguistic typology and corpus linguistics. The technological contribution of this paper is to demonstrate the applicability of Fintan for querying across heterogeneously annotated corpora, as previously, it had only been applied for transformation tasks. With its focus on quantitative analysis, Fintan is a natural complement for existing multi-layer technologies that focus on query and exploration.

pdf bib
Unifying Morphology Resources with OntoLex-Morph. A Case Study in German
Christian Chiarcos | Christian Fäth | Maxim Ionov
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The OntoLex vocabulary has become a widely used community standard for machine-readable lexical resources on the web. The primary motivation to use OntoLex in favor of tool- or application-specific formalisms is to facilitate interoperability and information integration across different resources. One of its extension that is currently being developed is a module for representing morphology, OntoLex-Morph. In this paper, we show how OntoLex-Morph can be used for the encoding and integration of different types of morphological resources on a unified basis. With German as the example, we demonstrate it for (a) a full-form dictionary with inflection information (Unimorph), (b) a dictionary of base forms and their derivations (UDer), (c) a dictionary of compounds (from GermaNet), and (d) lexicon and inflection rules of a finite-state parser/generator (SMOR/Morphisto). These data are converted to OntoLex-Morph, their linguistic information is consolidated and corresponding lexical entries are linked with each other.

pdf bib
Modelling Collocations in OntoLex-FrAC
Christian Chiarcos | Katerina Gkirtzou | Maxim Ionov | Besim Kabashi | Fahad Khan | Ciprian-Octavian Truică
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference

Following presentations of frequency and attestations, and embeddings and distributional similarity, this paper introduces the third cornerstone of the emerging OntoLex module for Frequency, Attestation and Corpus-based Information, OntoLex-FrAC. We provide an RDF vocabulary for collocations, established as a consensus over contributions from five different institutions and numerous data sets, with the goal of eliciting feedback from reviewers, workshop audience and the scientific community in preparation of the final consolidation of the OntoLex-FrAC module, whose publication as a W3C community report is foreseen for the end of this year. The novel collocation component of OntoLex-FrAC is described in application to a lexicographic resource and corpus-based collocation scores available from the web, and finally, we demonstrate the capability and genericity of the model by showing how to retrieve and aggregate collocation information by means of SPARQL, and its export to a tabular format, so that it can be easily processed in downstream applications.

pdf bib
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference
Thierry Declerck | John P. McCrae | Elena Montiel | Christian Chiarcos | Maxim Ionov
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

2021

pdf bib
Embeddings for the Lexicon: Modelling and Representation
Christian Chiarcos | Thierry Declerck | Maxim Ionov
Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6)

2020

pdf bib
The ACoLi Dictionary Graph
Christian Chiarcos | Christian Fäth | Maxim Ionov
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we report the release of the ACoLi Dictionary Graph, a large-scale collection of multilingual open source dictionaries available in two machine-readable formats, a graph representation in RDF, using the OntoLex-Lemon vocabulary, and a simple tabular data format to facilitate their use in NLP tasks, such as translation inference across dictionaries. We describe the mapping and harmonization of the underlying data structures into a unified representation, its serialization in RDF and TSV, and the release of a massive and coherent amount of lexical data under open licenses.

pdf bib
Fintan - Flexible, Integrated Transformation and Annotation eNgineering
Christian Fäth | Christian Chiarcos | Björn Ebbrecht | Maxim Ionov
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce the Flexible and Integrated Transformation and Annotation eNgeneering (Fintan) platform for converting heterogeneous linguistic resources to RDF. With its modular architecture, workflow management and visualization features, Fintan facilitates the development of complex transformation pipelines by integrating generic RDF converters and augmenting them with extended graph processing capabilities: Existing converters can be easily deployed to the system by means of an ontological data structure which renders their properties and the dependencies between transformation steps. Development of subsequent graph transformation steps for resource transformation, annotation engineering or entity linking is further facilitated by a novel visual rendering of SPARQL queries. A graphical workflow manager allows to easily manage the converter modules and combine them to new transformation pipelines. Employing the stream-based graph processing approach first implemented with CoNLL-RDF, we address common challenges and scalability issues when transforming resources and showcase the performance of Fintan by means of a purely graph-based transformation of the Universal Morphology data to RDF.

pdf bib
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)
Maxim Ionov | John P. McCrae | Christian Chiarcos | Thierry Declerck | Julia Bosque-Gil | Jorge Gracia
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)

pdf bib
Modelling Frequency and Attestations for OntoLex-Lemon
Christian Chiarcos | Maxim Ionov | Jesse de Does | Katrien Depuydt | Anas Fahad Khan | Sander Stolk | Thierry Declerck | John Philip McCrae
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.

2018

pdf bib
Universal Morphologies for the Caucasus region
Christian Chiarcos | Kathrin Donandt | Maxim Ionov | Monika Rind-Pawlowski | Hasmik Sargsian | Jesse Wichers Schreur | Frank Abromeit | Christian Fäth
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2015

pdf bib
Expanding the horizons: adding a new language to the news personalization system
Andrey Fedorovsky | Maxim Ionov | Varvara Litvinova | Tatyana Olenina | Darya Trofimova
Proceedings of the First Workshop on Computing News Storylines

2012

pdf bib
RU-EVAL-2012: Evaluating Dependency Parsers for Russian
Anastasia Gareyshina | Maxim Ionov | Olga Lyashevskaya | Dmitry Privoznov | Elena Sokolova | Svetlana Toldova
Proceedings of COLING 2012: Posters