Marco Idiart


2022

pdf bib
SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding
Harish Tayyar Madabushi | Edward Gow-Smith | Marcos Garcia | Carolina Scarton | Marco Idiart | Aline Villavicencio
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context. Each subtask includes different settings regarding the amount of training data. Besides the task description, this paper introduces the datasets in English, Portuguese, and Galician and their annotation procedure, the evaluation metrics, and a summary of the participant systems and their results. The task had close to 100 registered participants organised into twenty five teams making over 650 and 150 submissions in the practice and evaluation phases respectively.

2021

pdf bib
Probing for idiomaticity in vector space models
Marcos Garcia | Tiago Kramer Vieira | Carolina Scarton | Marco Idiart | Aline Villavicencio
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Contextualised word representation models have been successfully used for capturing different word usages and they may be an attractive alternative for representing idiomaticity in language. In this paper, we propose probing measures to assess if some of the expected linguistic properties of noun compounds, especially those related to idiomatic meanings, and their dependence on context and sensitivity to lexical choice, are readily available in some standard and widely used representations. For that, we constructed the Noun Compound Senses Dataset, which contains noun compounds and their paraphrases, in context neutral and context informative naturalistic sentences, in two languages: English and Portuguese. Results obtained using four types of probing measures with models like ELMo, BERT and some of its variants, indicate that idiomaticity is not yet accurately represented by contextualised models

pdf bib
Assessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levels
Marcos Garcia | Tiago Kramer Vieira | Carolina Scarton | Marco Idiart | Aline Villavicencio
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Accurate assessment of the ability of embedding models to capture idiomaticity may require evaluation at token rather than type level, to account for degrees of idiomaticity and possible ambiguity between literal and idiomatic usages. However, most existing resources with annotation of idiomaticity include ratings only at type level. This paper presents the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, with human annotations for 280 noun compounds in English and 180 in Portuguese at both type and token level. We compiled 8,725 and 5,091 token level annotations for English and Portuguese, respectively, which are strongly correlated with the corresponding scores obtained at type level. The NCTTI dataset is used to explore how vector space models reflect the variability of idiomaticity across sentences. Several experiments using state-of-the-art contextualised models suggest that their representations are not capturing the noun compounds idiomaticity as human annotators. This new multilingual resource also contains suggestions for paraphrases of the noun compounds both at type and token levels, with uses for lexical substitution or disambiguation in context.

2019

pdf bib
Unsupervised Compositionality Prediction of Nominal Compounds
Silvio Cordeiro | Aline Villavicencio | Marco Idiart | Carlos Ramisch
Computational Linguistics, Volume 45, Issue 1 - March 2019

Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results.

2018

pdf bib
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing
Marco Idiart | Alessandro Lenci | Thierry Poibeau | Aline Villavicencio
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

pdf bib
The brWaC Corpus: A New Open Resource for Brazilian Portuguese
Jorge A. Wagner Filho | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Similarity Measures for the Detection of Clinical Conditions with Verbal Fluency Tasks
Felipe Paula | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Semantic Verbal Fluency tests have been used in the detection of certain clinical conditions, like Dementia. In particular, given a sequence of semantically related words, a large number of switches from one semantic class to another has been linked to clinical conditions. In this work, we investigate three similarity measures for automatically identifying switches in semantic chains: semantic similarity from a manually constructed resource, and word association strength and semantic relatedness, both calculated from corpora. This information is used for building classifiers to distinguish healthy controls from clinical cases with early stages of Alzheimer’s Disease and Mild Cognitive Deficits. The overall results indicate that for clinical conditions the classifiers that use these similarity measures outperform those that use a gold standard taxonomy.

2017

pdf bib
LexSubNC: A Dataset of Lexical Substitution for Nominal Compounds
Rodrigo Wilkens | Leonardo Zilio | Silvio Ricardo Cordeiro | Felipe Paula | Carlos Ramisch | Marco Idiart | Aline Villavicencio
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

2016

pdf bib
Multiword Expressions in Child Language
Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of this work is to introduce CHILDES-MWE, which contains English CHILDES corpora automatically annotated with Multiword Expressions (MWEs) information. The result is a resource with almost 350,000 sentences annotated with more than 70,000 distinct MWEs of various types from both longitudinal and latitudinal corpora. This resource can be used for large scale language acquisition studies of how MWEs feature in child language. Focusing on compound nouns (CN), we then verify in a longitudinal study if there are differences in the distribution and compositionality of CNs in child-directed and child-produced sentences across ages. Moreover, using additional latitudinal data, we investigate if there are further differences in CN usage and in compositionality preferences. The results obtained for the child-produced sentences reflect CN distribution and compositionality in child-directed sentences.

pdf bib
Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time
Silvio Cordeiro | Carlos Ramisch | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality
Carlos Ramisch | Silvio Cordeiro | Leonardo Zilio | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations
Alexandre Salle | Aline Villavicencio | Marco Idiart
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2014

pdf bib
Comparing Similarity Measures for Distributional Thesauri
Muntsa Padró | Marco Idiart | Aline Villavicencio | Carlos Ramisch
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Distributional thesauri have been applied for a variety of tasks involving semantic relatedness. In this paper, we investigate the impact of three parameters: similarity measures, frequency thresholds and association scores. We focus on the robustness and stability of the resulting thesauri, measuring inter-thesaurus agreement when testing different parameter values. The results obtained show that low-frequency thresholds affect thesaurus quality more than similarity measures, with more agreement found for increasing thresholds. These results indicate the sensitivity of distributional thesauri to frequency. Nonetheless, the observed differences do not transpose over extrinsic evaluation using TOEFL-like questions. While this may be specific to the task, we argue that a careful examination of the stability of distributional resources prior to application is needed.

pdf bib
Nothing like Good Old Frequency: Studying Context Filters for Distributional Thesauri
Muntsa Padró | Marco Idiart | Aline Villavicencio | Carlos Ramisch
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Language Acquisition and Probabilistic Models: keeping it simple
Aline Villavicencio | Marco Idiart | Robert Berwick | Igor Malioutov
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
An annotated English child language database
Aline Villavicencio | Beracah Yankama | Rodrigo Wilkens | Marco Idiart | Robert Berwick
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
Get out but don’t fall down: verb-particle constructions in child language
Aline Villavicencio | Marco Idiart | Carlos Ramisch | Vítor Araújo | Beracah Yankama | Robert Berwick
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
A large scale annotated child language construction database
Aline Villavicencio | Beracah Yankama | Marco Idiart | Robert Berwick
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Large scale annotated corpora of child language can be of great value in assessing theoretical proposals regarding language acquisition models. For example, they can help determine whether the type and amount of data required by a proposed language acquisition model can actually be found in a naturalistic data sample. To this end, several recent efforts have augmented the CHILDES child language corpora with POS tagging and parsing information for languages such as English. With the increasing availability of robust NLP systems and electronic resources, these corpora can be further annotated with more detailed information about the properties of words, verb argument structure, and sentences. This paper describes such an initiative for combining information from various sources to extend the annotation of the English CHILDES corpora with linguistic, psycholinguistic and distributional information, along with an example illustrating an application of this approach to the extraction of verb alternation information. The end result, the English CHILDES Verb Construction Database, is an integrated resource containing information such as grammatical relations, verb semantic classes, and age of acquisition, enabling more targeted complex searches involving different levels of annotation that can facilitate a more detailed analysis of the linguistic input available to children.

2008

pdf bib
Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity
Carlos Ramisch | Aline Villavicencio | Leonardo Moura | Marco Idiart
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf bib
Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering
Aline Villavicencio | Valia Kordoni | Yi Zhang | Marco Idiart | Carlos Ramisch
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Automated Multiword Expression Prediction for Grammar Engineering
Yi Zhang | Valia Kordoni | Aline Villavicencio | Marco Idiart
Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties