Arturo Montejo-Ráez


2021

pdf bib
Complex words identification using word-level features for SemEval-2020 Task 1
Jenny A. Ortiz-Zambrano | Arturo Montejo-Ráez
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This article describes a system to predict the complexity of words for the Lexical Complexity Prediction (LCP) shared task hosted at SemEval 2021 (Task 1) with a new annotated English dataset with a Likert scale. Located in the Lexical Semantics track, the task consisted of predicting the complexity value of the words in context. A machine learning approach was carried out based on the frequency of the words and several characteristics added at word level. Over these features, a supervised random forest regression algorithm was trained. Several runs were performed with different values to observe the performance of the algorithm. For the evaluation, our best results reported a M.A.E score of 0.07347, M.S.E. of 0.00938, and R.M.S.E. of 0.096871. Our experiments showed that, with a greater number of characteristics, the precision of the classification increases.

pdf bib
CLexIS2: A New Corpus for Complex Word Identification Research in Computing Studies
Jenny A. Ortiz Zambrano | Arturo Montejo-Ráez
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Reading is a complex process not only because of the words or sections that are difficult for the reader to understand. Complex word identification (CWI) is the task of detecting in the content of documents the words that are difficult or complex to understand by the people of a certain group. Annotated corpora for English learners are widely available, while they are less common for the Spanish language. In this article, we present CLexIS2, a new corpus in Spanish to contribute to the advancement of research in the area of Lexical Simplification, specifically in the identification and prediction of complex words in computing studies. Several metrics used to evaluate the complexity of texts in Spanish were applied, such as LC, LDI, ILFW, SSR, SCI, ASL, CS. Furthermore, as a baseline of the primer, two experiments have been performed to predict the complexity of words: one using a supervised learning approach and the other using an unsupervised solution based on the frequency of words on a general corpus.

pdf bib
OffendES: A New Corpus in Spanish for Offensive Language Research
Flor Miriam Plaza-del-Arco | Arturo Montejo-Ráez | L. Alfonso Ureña-López | María-Teresa Martín-Valdivia
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Offensive language detection and analysis has become a major area of research in Natural Language Processing. The freedom of participation in social media has exposed online users to posts designed to denigrate, insult or hurt them according to gender, race, religion, ideology, or other personal characteristics. Focusing on young influencers from the well-known social platforms of Twitter, Instagram, and YouTube, we have collected a corpus composed of 47,128 Spanish comments manually labeled on offensive pre-defined categories. A subset of the corpus attaches a degree of confidence to each label, so both multi-class classification and multi-output regression studies are possible. In this paper, we introduce the corpus, discuss its building process, novelties, and some preliminary experiments with it to serve as a baseline for the research community.

2019

pdf bib
SINAI-DL at SemEval-2019 Task 5: Recurrent networks and data augmentation by paraphrasing
Arturo Montejo-Ráez | Salud María Jiménez-Zafra | Miguel A. García-Cumbreras | Manuel Carlos Díaz-Galiano
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes the participation of the SINAI-DL team at Task 5 in SemEval 2019, called HatEval. We have applied some classic neural network layers, like word embeddings and LSTM, to build a neural classifier for both proposed tasks. Due to the small amount of training data provided compared to what is expected for an adequate learning stage in deep architectures, we explore the use of paraphrasing tools as source for data augmentation. Our results show that this method is promising, as some improvement has been found over non-augmented training sets.

pdf bib
SINAI-DL at SemEval-2019 Task 7: Data Augmentation and Temporal Expressions
Miguel A. García-Cumbreras | Salud María Jiménez-Zafra | Arturo Montejo-Ráez | Manuel Carlos Díaz-Galiano | Estela Saquete
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes the participation of the SINAI-DL team at RumourEval (Task 7 in SemEval 2019, subtask A: SDQC). SDQC addresses the challenge of rumour stance classification as an indirect way of identifying potential rumours. Given a tweet with several replies, our system classifies each reply into either supporting, denying, questioning or commenting on the underlying rumours. We have applied data augmentation, temporal expressions labelling and transfer learning with a four-layer neural classifier. We achieve an accuracy of 0.715 with the official run over reply tweets.

2017

pdf bib
SINAI at SemEval-2017 Task 4: User based classification
Salud María Jiménez-Zafra | Arturo Montejo-Ráez | Maite Martin | L. Alfonso Ureña-López
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This document describes our participation in SemEval-2017 Task 4: Sentiment Analysis in Twitter. We have only reported results for subtask B - English, determining the polarity towards a topic on a two point scale (positive or negative sentiment). Our main contribution is the integration of user information in the classification process. A SVM model is trained with Word2Vec vectors from user’s tweets extracted from his timeline. The obtained results show that user-specific classifiers trained on tweets from user timeline can introduce noise as they are error prone because they are classified by an imperfect system. This encourages us to explore further integration of user information for author-based Sentiment Analysis.

2016

pdf bib
Pictogrammar: an AAC device based on a semantic grammar
Fernando Martínez-Santiago | Miguel Ángel García-Cumbreras | Arturo Montejo-Ráez | Manuel Carlos Díaz-Galiano
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

2013

pdf bib
SINAI: Machine Learning and Emotion of the Crowd for Sentiment Analysis in Microblogs
Eugenio Martínez-Cámara | Arturo Montejo-Ráez | M. Teresa Martín-Valdivia | L. Alfonso Ureña-López
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
Random Walk Weighting over SentiWordNet for Sentiment Polarity Detection on Twitter
Arturo Montejo-Ráez | Eugenio Martínez-Cámara | M. Teresa Martín-Valdivia | L. Alfonso Ureña-López
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

2007

pdf bib
Combining Lexical-Syntactic Information with Machine Learning for Recognizing Textual Entailment
Arturo Montejo-Ráez | Jose Manuel Perea | Fernando Martínez-Santiago | Miguel Ángel García-Cumbreras | Maite Martín-Valdivia | Alfonso Ureña-López
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing