LexiDB: Patterns & Methods for Corpus Linguistic Database Management

Matthew Coole; Paul Rayson; John Mariani

LexiDB: Patterns & Methods for Corpus Linguistic Database Management

Matthew Coole, Paul Rayson, John Mariani

Abstract

LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), it is designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added and deleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories of DBMSs and CMSs, more specialised to language data than a general purpose DBMS but more flexible than a traditional static corpus management system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale out for ever growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying of multi-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP) and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent with large corpora and when handling queries with large result sets

Anthology ID:: 2020.lrec-1.383
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3128–3135
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.383
DOI:
Bibkey:
Cite (ACL):: Matthew Coole, Paul Rayson, and John Mariani. 2020. LexiDB: Patterns & Methods for Corpus Linguistic Database Management. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3128–3135, Marseille, France. European Language Resources Association.
Cite (Informal):: LexiDB: Patterns & Methods for Corpus Linguistic Database Management (Coole et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.383.pdf

PDF Cite Search