Gemma Bel-Enguix

Also published as: Gemma Bel Enguix, Gemma Bel Enguix, Gemma Bel-enguix


2023

pdf bib
HOMO-MEX: A Mexican Spanish Annotated Corpus for LGBT+phobia Detection on Twitter
Juan Vásquez | Scott Andersen | Gemma Bel-enguix | Helena Gómez-adorno | Sergio-luis Ojeda-trueba
The 7th Workshop on Online Abuse and Harms (WOAH)

In the past few years, the NLP community has actively worked on detecting LGBT+Phobia in online spaces, using textual data publicly available Most of these are for the English language and its variants since it is the most studied language by the NLP community. Nevertheless, efforts towards creating corpora in other languages are active worldwide. Despite this, the Spanish language is an understudied language regarding digital LGBT+Phobia. The only corpus we found in the literature was for the Peninsular Spanish dialects, which use LGBT+phobic terms different than those in the Mexican dialect. For this reason, we present Homo-MEX, a novel corpus for detecting LGBT+Phobia in Mexican Spanish. In this paper, we describe our data-gathering and annotation process. Also, we present a classification benchmark using various traditional machine learning algorithms and two pre-trained deep learning models to showcase our corpus classification potential.

2022

pdf bib
HeteroCorpus: A Corpus for Heteronormative Language Detection
Juan Vásquez | Gemma Bel-Enguix | Scott Thomas Andersen | Sergio-Luis Ojeda-Trueba
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

In recent years, plenty of work has been done by the NLP community regarding gender bias detection and mitigation in language systems. Yet, to our knowledge, no one has focused on the difficult task of heteronormative language detection and mitigation. We consider this an urgent issue, since language technologies are growing increasingly present in the world and, as it has been proven by various studies, NLP systems with biases can create real-life adverse consequences for women, gender minorities and racial minorities and queer people. For these reasons, we propose and evaluate HeteroCorpus; a corpus created specifically for studying heterononormative language in English. Additionally, we propose a baseline set of classification experiments on our corpus, in order to show the performance of our corpus in classification tasks.

2020

pdf bib
MineriaUNAM at SemEval-2020 Task 3: Predicting Contextual WordSimilarity Using a Centroid Based Approach and Word Embeddings
Helena Gomez-Adorno | Gemma Bel-Enguix | Jorge Reyes-Magaña | Benjamín Moreno | Ramón Casillas | Daniel Vargas
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents our systems to solve Task 3 of Semeval-2020, which aims to predict the effect that context has on human perception of similarity of words. The task consists of two subtasks in English, Croatian, Finnish, and Slovenian: (1) predicting the change of similarity and (2) predicting the human scores of similarity, both of them for a pair of words within two different contexts. We tackled the problem by developing two systems, the first one uses a centroid approach and word vectors. The second one uses the ELMo language model, which is trained for each pair of words with the given context. Our approach achieved the highest score in subtask 2 for the English language.

pdf bib
CPLM, a Parallel Corpus for Mexican Languages: Development and Interface
Gerardo Sierra Martínez | Cynthia Montaño | Gemma Bel-Enguix | Diego Córdova | Margarita Mota Montoya
Proceedings of the Twelfth Language Resources and Evaluation Conference

Mexico is a Spanish speaking country that has a great language diversity, with 68 linguistic groups and 364 varieties. As they face a lack of representation in education, government, public services and media, they present high levels of endangerment. Due to the lack of data available on social media and the internet, few technologies have been developed for these languages. To analyze different linguistic phenomena in the country, the Language Engineering Group developed the Corpus Paralelo de Lenguas Mexicanas (CPLM) [The Mexican Languages Parallel Corpus], a collaborative parallel corpus for the low-resourced languages of Mexico. The CPLM aligns Spanish with six indigenous languages: Maya, Ch’ol, Mazatec, Mixtec, Otomi, and Nahuatl. First, this paper describes the process of building the CPLM: text searching, digitalization and alignment process. Furthermore, we present some difficulties regarding dialectal and orthographic variations. Second, we present the interface and types of searching as well as the use of filters.

pdf bib
Enhancing Job Searches in Mexico City with Language Technologies
Gerardo Sierra Martínez | Gemma Bel-Enguix | Helena Gómez-Adorno | Juan Manuel Torres Moreno | Tonatiuh Hernández-García | Julio V Guadarrama-Olvera | Jesús-Germán Ortiz-Barajas | Ángela María Rojas | Tomas Damerau | Soledad Aragón Martínez
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)

In this paper, we show the enhancing of the Demanded Skills Diagnosis (DiCoDe: Diagnóstico de Competencias Demandadas), a system developed by Mexico City’s Ministry of Labor and Employment Promotion (STyFE: Secretaría de Trabajo y Fomento del Empleo de la Ciudad de México) that seeks to reduce information asymmetries between job seekers and employers. The project uses webscraping techniques to retrieve job vacancies posted on private job portals on a daily basis and with the purpose of informing training and individual case management policies as well as labor market monitoring. For this purpose, a collaboration project between STyFE and the Language Engineering Group (GIL: Grupo de Ingeniería Lingüística) was established in order to enhance DiCoDe by applying NLP models and semantic analysis. By this collaboration, DiCoDe’s job vacancies system’s macro-structure and its geographic referencing at the city hall (municipality) level were improved. More specifically, dictionaries were created to identify demanded competencies, skills and abilities (CSA) and algorithms were developed for dynamic classifying of vacancies and identifying terms for searches on free text, in order to improve the results and processing time of queries.

pdf bib
Automatic Word Association Norms (AWAN)
Jorge Reyes-Magaña | Gerardo Sierra Martínez | Gemma Bel-Enguix | Helena Gomez-Adorno
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

Word Association Norms (WAN) are collections that present stimuli words and the set of their associated responses. The corpus is widely used in diverse areas of expertise. In order to reduce the effort to have a good quality resource that can be reproduced in many languages with minimum sources, a methodology to build Automatic Word Association Norms is proposed (AWAN). The methodology has an input of two simple elements: a) dictionary, and b) pre-processed Word Embeddings. This new kind of WAN is evaluated in two ways: i) learning word embeddings based on the node2vec algorithm and comparing them with human annotated benchmarks, and ii) performing a lexical search for a reverse dictionary. Both evaluations are done in a weighted graph with the AWAN lexical elements. The results showed that the methodology produces good quality AWANs.

pdf bib
Temporal Relations Annotation and Extrapolation Based on Semi-intervals and Boundig Relations
Alejandro Pimentel | Gemma Bel Enguix | Gerardo Sierra Martínez | Azucena Montes
Proceedings of the 28th International Conference on Computational Linguistics

The computational treatment of temporal relations is based on the work of Allen, who establishes 13 different types, and Freksa, who designs a cognitive procedure to manage them. Freksa’s notation is not widely used because, although it has cognitive and expressive advantages, it is too complex from the computational perspective. This paper proposes a system for the annotation and management of temporal relations that combines the richness and expressiveness of Freksa’s approach with the simplicity of Allen’s notation. Our method is summarized in the application of bounding relations, thanks to which it is possible to obtain the temporary representation of complete neighborhoods capable of representing vague temporal relations such as those that can be frequently found in a text. Such advantages are obtained without the need to greatly increase the complexity of the labeling process since the markup language is almost the same as TimeML, to which only a second temporary “relType”’ type label relationship is added. Our experiments show that the temporal relationships that present vagueness are in fact much more common than those in which a single relationship can be established precisely. For these reasons, our new labeling system achieves a more agreeable representation of temporal relations.

2019

pdf bib
MineriaUNAM at SemEval-2019 Task 5: Detecting Hate Speech in Twitter using Multiple Features in a Combinatorial Framework
Luis Enrique Argota Vega | Jorge Carlos Reyes-Magaña | Helena Gómez-Adorno | Gemma Bel-Enguix
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper presents our approach to the Task 5 of Semeval-2019, which aims at detecting hate speech against immigrants and women in Twitter. The task consists of two sub-tasks, in Spanish and English: (A) detection of hate speech and (B) classification of hateful tweets as aggressive or not, and identification of the target harassed as individual or group. We used linguistically motivated features and several types of n-grams (words, characters, functional words, punctuation symbols, POS, among others). For task A, we trained a Support Vector Machine using a combinatorial framework, whereas for task B we followed a multi-labeled approach using the Random Forest classifier. Our approach achieved the highest F1-score in sub-task A for the Spanish language.

bib
A Parallel Corpus Mixtec-Spanish
Cynthia Montaño | Gerardo Sierra Martínez | Gemma Bel-Enguix | Helena Gomez
Proceedings of the 2019 Workshop on Widening NLP

This work is about the compilation process of parallel documents Spanish-Mixtec. There are not many Spanish-Mixec parallel texts and most of the sources are non-digital books. Due to this, we need to face the errors when digitizing the sources and difficulties in sentence alignment, as well as the fact that does not exist a standard orthography. Our parallel corpus consists of sixty texts coming from books and digital repositories. These documents belong to different domains: history, traditional stories, didactic material, recipes, ethnographical descriptions of each town and instruction manuals for disease prevention. We have classified this material in five major categories: didactic (6 texts), educative (6 texts), interpretative (7 texts), narrative (39 texts), and poetic (2 texts). The final total of tokens is 49,814 Spanish words and 47,774 Mixtec words. The texts belong to the states of Oaxaca (48 texts), Guerrero (9 texts) and Puebla (3 texts). According to this data, we see that the corpus is unbalanced in what refers to the representation of the different territories. While 55% of speakers are in Oaxaca, 80% of texts come from this region. Guerrero has the 30% of speakers and the 15% of texts and Puebla, with the 15% of the speakers has a representation of the 5% in the corpus.

2018

pdf bib
Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students
Alejandro Dorantes | Gerardo Sierra | Tlauhlia Yamín Donohue Pérez | Gemma Bel-Enguix | Mónica Jasso Rosales
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

This work presents the Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students, a corpus of raw data for general use. Its purpose is to offer data for the study of of language and interactions via Instant Messaging (IM) among bachelors. Our paper consists of an overview of both the corpus’s content and demographic metadata. Furthermore, it presents the current research being conducted with it —namely parenthetical expressions, orality traits, and code-switching. This work also includes a brief outline of similar corpora and recent studies in the field of IM.

pdf bib
Textual Features Indicative of Writing Proficiency in Elementary School Spanish Documents
Gemma Bel-Enguix | Diana Dueñas Chávez | Arturo Curiel Díaz
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

Childhood acquisition of written language is not straightforward. Writing skills evolve differently depending on external factors, such as the conditions in which children practice their productions and the quality of their instructors’ guidance. This can be challenging in low-income areas, where schools may struggle to ensure ideal acquisition conditions. Developing computational tools to support the learning process may counterweight negative environmental influences; however, few work exists on the use of information technologies to improve childhood literacy. This work centers around the computational study of Spanish word and syllable structure in documents written by 2nd and 3rd year elementary school students. The studied texts were compared against a corpus of short stories aimed at the same age group, so as to observe whether the children tend to produce similar written patterns as the ones they are expected to interpret at their literacy level. The obtained results show some significant differences between the two kinds of texts, pointing towards possible strategies for the implementation of new education software in support of written language acquisition.

pdf bib
Word-word Relations in Dementia and Typical Aging
Natalia Arias-Trejo | Aline Minto-García | Diana I. Luna-Umanzor | Alma E. Ríos-Ponce | Balderas-Pliego Mariana | Gemma Bel-Enguix
Proceedings of the First International Workshop on Language Cognition and Computational Models

Older adults tend to suffer a decline in some of their cognitive capabilities, being language one of least affected processes. Word association norms (WAN) also known as free word associations reflect word-word relations, the participant reads or hears a word and is asked to write or say the first word that comes to mind. Free word associations show how the organization of semantic memory remains almost unchanged with age. We have performed a WAN task with very small samples of older adults with Alzheimer’s disease (AD), vascular dementia (VaD) and mixed dementia (MxD), and also with a control group of typical aging adults, matched by age, sex and education. All of them are native speakers of Mexican Spanish. The results show, as expected, that Alzheimer disease has a very important impact in lexical retrieval, unlike vascular and mixed dementia. This suggests that linguistic tests elaborated from WAN can be also used for detecting AD at early stages.

2014

pdf bib
A Graph-Based Approach for Computing Free Word Associations
Gemma Bel Enguix | Reinhard Rapp | Michael Zock
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

A graph-based algorithm is used to analyze the co-occurrences of words in the British National Corpus. It is shown that the statistical regularities detected can be exploited to predict human word associations. The corpus-derived associations are evaluated using a large test set comprising several thousand stimulus/response pairs as collected from humans. The finding is that there is a high agreement between the two types of data. The considerable size of the test set allows us to split the stimulus words into a number of classes relating to particular word properties. For example, we construct six saliency classes, and for the words in each of these classes we compare the simulation results with the human data. It turns out that for each class there is a close relationship between the performance of our system and human performance. This is also the case for classes based on two other properties of words, namely syntactic and semantic word ambiguity. We interpret these findings as evidence for the claim that human association acquisition must be based on the statistical analysis of perceived language and that when producing associations the detected statistical regularities are replicated.

pdf bib
How well can a corpus-derived co-occurrence network simulate human associative behavior?
Gemma Bel Enguix | Reinhard Rapp | Michael Zock
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

pdf bib
Retrieving Word Associations with a Simple Neighborhood Algorithm in a Graph-based Resource
Gemma Bel-Enguix
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf bib
TALN-RECITAL 2014 Workshop RLTLN 2014 : Réseaux Lexicaux pour le TAL (RLTLN 2014 : Lexical Networks for NLP)
Michael Zock | Gemma Bel-Enguix | Reinhard Rapp
TALN-RECITAL 2014 Workshop RLTLN 2014 : Réseaux Lexicaux pour le TAL (RLTLN 2014 : Lexical Networks for NLP)

pdf bib
Linguistic Convergence in Societies with Asymmetrically Distributed Reputation
Gemma Bel-Enguix
TALN-RECITAL 2014 Workshop RLTLN 2014 : Réseaux Lexicaux pour le TAL (RLTLN 2014 : Lexical Networks for NLP)

2013

pdf bib
Lexical access via a simple co-occurrence network (Trouver les mots dans un simple réseau de co-occurrences) [in French]
Gemma Bel-Enguix | Michael Zock
Proceedings of TALN 2013 (Volume 2: Short Papers)

2006

pdf bib
Ambiguous Turn-Taking Games in Conversations
Gemma Bel-Enguix | Maria Dolores Jiménez-López
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Human-computer interfaces require models of dialogue structure that capture the variability and unpredictability within dialogue. Semantic and pragmatic context are continuously evolving during conversation, especially by the distribution of turns that have a direct effect in dialogue exchanges. In this paper we use a formal language paradigm for modelling multi-agent system conversations. Our computational model combines pragmatic minimal units –speech acts– for constructing dialogues. In this framework, we show how turn-taking distribution can be ambiguous and propose an algorithm for solving it, considering turn coherence, trajectories and turn pairing. Finally, we suggest overlapping as one of the possible phenomena emerging from an unresolved turn-taking.