Egoitz Laparra


2022

pdf bib
Taxonomy Builder: a Data-driven and User-centric Tool for Streamlining Taxonomy Construction
Mihai Surdeanu | John Hungerford | Yee Seng Chan | Jessica MacBride | Benjamin Gyori | Andrew Zupon | Zheng Tang | Haoling Qiu | Bonan Min | Yan Zverev | Caitlin Hilverman | Max Thomas | Walter Andrews | Keith Alcock | Zeyu Zhang | Michael Reynolds | Steven Bethard | Rebecca Sharp | Egoitz Laparra
Proceedings of the Second Workshop on Bridging Human--Computer Interaction and Natural Language Processing

An existing domain taxonomy for normalizing content is often assumed when discussing approaches to information extraction, yet often in real-world scenarios there is none. When one does exist, as the information needs shift, it must be continually extended. This is a slow and tedious task, and one which does not scale well. Here we propose an interactive tool that allows a taxonomy to be built or extended rapidly and with a human in the loop to control precision. We apply insights from text summarization and information extraction to reduce the search space dramatically, then leverage modern pretrained language models to perform contextualized clustering of the remaining concepts to yield candidate nodes for the user to review. We show this allows a user to consider as many as 200 taxonomy concept candidates an hour, to quickly build or extend a taxonomy to better fit information needs.

2021

pdf bib
SemEval-2021 Task 10: Source-Free Domain Adaptation for Semantic Processing
Egoitz Laparra | Xin Su | Yiyun Zhao | Özlem Uzuner | Timothy Miller | Steven Bethard
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents the Source-Free Domain Adaptation shared task held within SemEval-2021. The aim of the task was to explore adaptation of machine-learning models in the face of data sharing constraints. Specifically, we consider the scenario where annotations exist for a domain but cannot be shared. Instead, participants are provided with models trained on that (source) data. Participants also receive some labeled data from a new (development) domain on which to explore domain adaptation algorithms. Participants are then tested on data representing a new (target) domain. We explored this scenario with two different semantic tasks: negation detection (a text classification task) and time expression recognition (a sequence tagging task).

pdf bib
Domain adaptation in practice: Lessons from a real-world information extraction pipeline
Timothy Miller | Egoitz Laparra | Steven Bethard
Proceedings of the Second Workshop on Domain Adaptation for NLP

Advances in transfer learning and domain adaptation have raised hopes that once-challenging NLP tasks are ready to be put to use for sophisticated information extraction needs. In this work, we describe an effort to do just that – combining state-of-the-art neural methods for negation detection, document time relation extraction, and aspectual link prediction, with the eventual goal of extracting drug timelines from electronic health record text. We train on the THYME colon cancer corpus and test on both the THYME brain cancer corpus and an internal corpus, and show that performance of the combined systems is unacceptable despite good performance of individual systems. Although domain adaptation shows improvements on each individual system, the model selection problem is a barrier to improving overall pipeline performance.

2020

pdf bib
A Dataset and Evaluation Framework for Complex Geographical Description Parsing
Egoitz Laparra | Steven Bethard
Proceedings of the 28th International Conference on Computational Linguistics

Much previous work on geoparsing has focused on identifying and resolving individual toponyms in text like Adrano, S.Maria di Licodia or Catania. However, geographical locations occur not only as individual toponyms, but also as compositions of reference geolocations joined and modified by connectives, e.g., “. . . between the towns of Adrano and S.Maria di Licodia, 32 kilometres northwest of Catania”. Ideally, a geoparser should be able to take such text, and the geographical shapes of the toponyms referenced within it, and parse these into a geographical shape, formed by a set of coordinates, that represents the location described. But creating a dataset for this complex geoparsing task is difficult and, if done manually, would require a huge amount of effort to annotate the geographical shapes of not only the geolocation described but also the reference toponyms. We present an approach that automates most of the process by combining Wikipedia and OpenStreetMap. As a result, we have gathered a collection of 360,187 uncurated complex geolocation descriptions, from which we have manually curated 1,000 examples intended to be used as a test set. To accompany the data, we define a new geoparsing evaluation framework along with a scoring methodology and a set of baselines.

2019

pdf bib
Pre-trained Contextualized Character Embeddings Lead to Major Improvements in Time Normalization: a Detailed Analysis
Dongfang Xu | Egoitz Laparra | Steven Bethard
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Recent studies have shown that pre-trained contextual word embeddings, which assign the same word different vectors in different contexts, improve performance in many tasks. But while contextual embeddings can also be trained at the character level, the effectiveness of such embeddings has not been studied. We derive character-level contextual embeddings from Flair (Akbik et al., 2018), and apply them to a time normalization task, yielding major performance improvements over the previous state-of-the-art: 51% error reduction in news and 33% in clinical notes. We analyze the sources of these improvements, and find that pre-trained contextual character embeddings are more robust to term variations, infrequent terms, and cross-domain changes. We also quantify the size of context that pre-trained contextual character embeddings take advantage of, and show that such embeddings capture features like part-of-speech and capitalization.

pdf bib
University of Arizona at SemEval-2019 Task 12: Deep-Affix Named Entity Recognition of Geolocation Entities
Vikas Yadav | Egoitz Laparra | Ti-Tai Wang | Mihai Surdeanu | Steven Bethard
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the Named Entity Recognition (NER) and disambiguation model used by the University of Arizona team (UArizona) for the SemEval 2019 task 12. We achieved fourth place on tasks 1 and 3. We implemented a deep-affix based LSTM-CRF NER model for task 1, which utilizes only character, word, pre- fix and suffix information for the identification of geolocation entities. Despite using just the training data provided by task organizers and not using any lexicon features, we achieved 78.85% strict micro F-score on task 1. We used the unsupervised population heuristics for task 3 and achieved 52.99% strict micro-F1 score in this task.

pdf bib
Eidos, INDRA, & Delphi: From Free Text to Executable Causal Models
Rebecca Sharp | Adarsh Pyarelal | Benjamin Gyori | Keith Alcock | Egoitz Laparra | Marco A. Valenzuela-Escárcega | Ajay Nagesh | Vikas Yadav | John Bachman | Zheng Tang | Heather Lent | Fan Luo | Mithun Paul | Steven Bethard | Kobus Barnard | Clayton Morrison | Mihai Surdeanu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Building causal models of complicated phenomena such as food insecurity is currently a slow and labor-intensive manual process. In this paper, we introduce an approach that builds executable probabilistic models from raw, free text. The proposed approach is implemented through three systems: Eidos, INDRA, and Delphi. Eidos is an open-domain machine reading system designed to extract causal relations from natural language. It is rule-based, allowing for rapid domain transfer, customizability, and interpretability. INDRA aggregates multiple sources of causal information and performs assembly to create a coherent knowledge base and assess its reliability. This assembled knowledge serves as the starting point for modeling. Delphi is a modeling framework that assembles quantified causal fragments and their contexts into executable probabilistic models that respect the semantics of the original text, and can be used to support decision making.

pdf bib
Inferring missing metadata from environmental policy texts
Steven Bethard | Egoitz Laparra | Sophia Wang | Yiyun Zhao | Ragheb Al-Ghezi | Aaron Lien | Laura López-Hoffman
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The National Environmental Policy Act (NEPA) provides a trove of data on how environmental policy decisions have been made in the United States over the last 50 years. Unfortunately, there is no central database for this information and it is too voluminous to assess manually. We describe our efforts to enable systematic research over US environmental policy by extracting and organizing metadata from the text of NEPA documents. Our contributions include collecting more than 40,000 NEPA-related documents, and evaluating rule-based baselines that establish the difficulty of three important tasks: identifying lead agencies, aligning document versions, and detecting reused text.

2018

pdf bib
From Characters to Time Intervals: New Paradigms for Evaluation and Neural Parsing of Time Normalizations
Egoitz Laparra | Dongfang Xu | Steven Bethard
Transactions of the Association for Computational Linguistics, Volume 6

This paper presents the first model for time normalization trained on the SCATE corpus. In the SCATE schema, time expressions are annotated as a semantic composition of time entities. This novel schema favors machine learning approaches, as it can be viewed as a semantic parsing task. In this work, we propose a character level multi-output neural network that outperforms previous state-of-the-art built on the TimeML schema. To compare predictions of systems that follow both SCATE and TimeML, we present a new scoring metric for time intervals. We also apply this new metric to carry out a comparative analysis of the annotations of both schemes in the same corpus.

pdf bib
Detecting Diabetes Risk from Social Media Activity
Dane Bell | Egoitz Laparra | Aditya Kousik | Terron Ishihara | Mihai Surdeanu | Stephen Kobourov
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

This work explores the detection of individuals’ risk of type 2 diabetes mellitus (T2DM) directly from their social media (Twitter) activity. Our approach extends a deep learning architecture with several contributions: following previous observations that language use differs by gender, it captures and uses gender information through domain adaptation; it captures recency of posts under the hypothesis that more recent posts are more representative of an individual’s current risk status; and, lastly, it demonstrates that in this scenario where activity factors are sparsely represented in the data, a bag-of-word neural network model using custom dictionaries of food and activity words performs better than other neural sequence models. Our best model, which incorporates all these contributions, achieves a risk-detection F1 of 41.9, considerably higher than the baseline rate (36.9).

pdf bib
SemEval 2018 Task 6: Parsing Time Normalizations
Egoitz Laparra | Dongfang Xu | Ahmed Elsayed | Steven Bethard | Martha Palmer
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper presents the outcomes of the Parsing Time Normalization shared task held within SemEval-2018. The aim of the task is to parse time expressions into the compositional semantic graphs of the Semantically Compositional Annotation of Time Expressions (SCATE) schema, which allows the representation of a wider variety of time expressions than previous approaches. Two tracks were included, one to evaluate the parsing of individual components of the produced graphs, in a classic information extraction way, and another one to evaluate the quality of the time intervals resulting from the interpretation of those graphs. Though 40 participants registered for the task, only one team submitted output, achieving 0.55 F1 in Track 1 (parsing) and 0.70 F1 in Track 2 (intervals).

2016

pdf bib
The Event and Implied Situation Ontology (ESO): Application and Evaluation
Roxane Segers | Marco Rospocher | Piek Vossen | Egoitz Laparra | German Rigau | Anne-Lyse Minard
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the Event and Implied Situation Ontology (ESO), a manually constructed resource which formalizes the pre and post situations of events and the roles of the entities affected by an event. The ontology is built on top of existing resources such as WordNet, SUMO and FrameNet. The ontology is injected to the Predicate Matrix, a resource that integrates predicate and role information from amongst others FrameNet, VerbNet, PropBank, NomBank and WordNet. We illustrate how these resources are used on large document collections to detect information that otherwise would have remained implicit. The ontology is evaluated on two aspects: recall and precision based on a manually annotated corpus and secondly, on the quality of the knowledge inferred by the situation assertions in the ontology. Evaluation results on the quality of the system show that 50% of the events typed and enriched with ESO assertions are correct.

pdf bib
A Multilingual Predicate Matrix
Maddalen Lopez de Lacalle | Egoitz Laparra | Itziar Aldabe | German Rigau
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the Predicate Matrix 1.3, a lexical resource resulting from the integration of multiple sources of predicate information including FrameNet, VerbNet, PropBank and WordNet. This new version of the Predicate Matrix has been extended to cover nominal predicates by adding mappings to NomBank. Similarly, we have integrated resources in Spanish, Catalan and Basque. As a result, the Predicate Matrix 1.3 provides a multilingual lexicon to allow interoperable semantic analysis in multiple languages.

pdf bib
The Predicate Matrix and the Event and Implied Situation Ontology: Making More of Events
Roxane Segers | Egoitz Laparra | Marco Rospocher | Piek Vossen | German Rigau | Filip Ilievski
Proceedings of the 8th Global WordNet Conference (GWC)

This paper presents the Event and Implied Situation Ontology (ESO), a resource which formalizes the pre and post situations of events and the roles of the entities affected by an event. The ontology reuses and maps across existing resources such as WordNet, SUMO, VerbNet, PropBank and FrameNet. We describe how ESO is injected into a new version of the Predicate Matrix and illustrate how these resources are used to detect information in large document collections that otherwise would have remained implicit. The model targets interpretations of situations rather than the semantics of verbs per se. The event is interpreted as a situation using RDF taking all event components into account. Hence, the ontology and the linked resources need to be considered from the perspective of this interpretation model.

2015

pdf bib
Document Level Time-anchoring for TimeLine Extraction
Egoitz Laparra | Itziar Aldabe | German Rigau
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Semantic Interoperability for Cross-lingual and cross-document Event Detection
Piek Vossen | Egoitz Laparra | German Rigau | Itziar Aldabe
Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
From TimeLines to StoryLines: A preliminary proposal for evaluating narratives
Egoitz Laparra | Itziar Aldabe | German Rigau
Proceedings of the First Workshop on Computing News Storylines

2014

pdf bib
Predicate Matrix: extending SemLink through WordNet mappings
Maddalen Lopez de Lacalle | Egoitz Laparra | German Rigau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the Predicate Matrix v1.1, a new lexical resource resulting from the integration of multiple sources of predicate information including FrameNet, VerbNet, PropBank and WordNet. We start from the basis of SemLink. Then, we use advanced graph-based algorithms to further extend the mapping coverage of SemLink. Second, we also exploit the current content of SemLink to infer new role mappings among the different predicate schemas. As a result, we have obtained a new version of the Predicate Matrix which largely extends the current coverage of SemLink and the previous version of the Predicate Matrix.

pdf bib
First steps towards a Predicate Matrix
Maddalen López de Lacalle | Egoitz Laparra | German Rigau
Proceedings of the Seventh Global Wordnet Conference

2013

pdf bib
Sources of Evidence for Implicit Argument Resolution
Egoitz Laparra | German Rigau
Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers

pdf bib
ImpAr: A Deterministic Algorithm for Implicit Semantic Role Labelling
Egoitz Laparra | German Rigau
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
Multilingual Central Repository version 3.0
Aitor Gonzalez-Agirre | Egoitz Laparra | German Rigau
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the upgrading process of the Multilingual Central Repository (MCR). The new MCR uses WordNet 3.0 as Interlingual-Index (ILI). Now, the current version of the MCR integrates in the same EuroWordNet framework wordnets from five different languages: English, Spanish, Catalan, Basque and Galician. In order to provide ontological coherence to all the integrated wordnets, the MCR has also been enriched with a disparate set of ontologies: Base Concepts, Top Ontology, WordNet Domains and Suggested Upper Merged Ontology. The whole content of the MCR is freely available.

pdf bib
Mapping WordNet to the Kyoto ontology
Egoitz Laparra | German Rigau | Piek Vossen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the connection of WordNet to a generic ontology based on DOLCE. We developed a complete set of heuristics for mapping all WordNet nouns, verbs and adjectives to the ontology. Moreover, the mapping also allows to represent predicates in a uniform and interoperable way, regardless of the way they are expressed in the text and in which language. Together with the ontology, the WordNet mappings provide a extremely rich and powerful basis for semantic processing of text in any domain. In particular, the mapping has been used in a knowledge-rich event-mining system developed for the Asian-European project KYOTO.

2010

pdf bib
Integrating a Large Domain Ontology of Species into WordNet
Montse Cuadros | Egoitz Laparra | German Rigau | Piek Vossen | Wauter Bosma
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

With the proliferation of applications sharing information represented in multiple ontologies, the development of automatic methods for robust and accurate ontology matching will be crucial to their success. Connecting and merging already existing semantic networks is perhaps one of the most challenging task related to knowledge engineering. This paper presents a new approach for aligning automatically a very large domain ontology of Species to WordNet in the framework of the KYOTO project. The approach relies on the use of knowledge-based Word Sense Disambiguation algorithm which accurately assigns WordNet synsets to the concepts represented in Species 2000.

pdf bib
eXtended WordFrameNet
Egoitz Laparra | German Rigau
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents a novel automatic approach to partially integrate FrameNet and WordNet. In that way we expect to extend FrameNet coverage, to enrich WordNet with frame semantic information and possibly to extend FrameNet to languages other than English. The method uses a knowledge-based Word Sense Disambiguation algorithm for matching the FrameNet lexical units to WordNet synsets. Specifically, we exploit a graph-based Word Sense Disambiguation algorithm that uses a large-scale knowledge-base derived from existing semantic resources. We have developed and tested additional versions of this algorithm showing substantial improvements over state-of-the-art results. Finally, we show some examples and figures of the resulting semantic resource.

2009

pdf bib
Integrating WordNet and FrameNet using a Knowledge-based Word Sense Disambiguation Algorithm
Egoitz Laparra | German Rigau
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Complete and Consistent Annotation of WordNet using the Top Concept Ontology
Javier Álvez | Jordi Atserias | Jordi Carrera | Salvador Climent | Egoitz Laparra | Antoni Oliver | German Rigau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the complete and consistent ontological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets, the so-called Base Concepts, was ontologized in such a way. The work has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy while setting inheritance blockage points. Since this labeling has been set on the EuroWordNet’s Interlingual Index (ILI), it can be also used to populate any other wordnet linked to it through a simple porting process. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks and for testing for the first time componential analysis on real environments. Moreover, the quantitative analysis of the work shows that more than 40% of the nominal part of WordNet is involved in structure errors or inadequacies.