Taha Tobaili


2020

pdf bib
Lexical Induction of Morphological and Orthographic Forms for Low-Resourced Languages
Taha Tobaili
Proceedings of the Third Workshop on Multilingual Surface Realisation

In this work we address the issue of high-degree lexical sparsity for non-standard languages under severe circumstance of small resources that are considered insufficient to train recent powerful language models. We proposed a new rule-based approach and utilised word embeddings to connect words with their inflectional and orthographic forms from a given corpus. Our case example is the low-resourced Lebanese dialect Arabizi. Arabizi is the name given to a new social transcription of the spoken Arabic in Latin script. The term comes from the portmanteau of Araby (Arabic) and Englizi (English). It is an informal written language where Arabs transcribe their dialectal mother tongue in text using Latin alphanumeral instead of Arabic script. For example حبيبي Ḥabībī my-love could be transcribed as 7abibi in Arabizi. We induced 175K forms from a list of 1.7K sentiment words. We evaluated this induction extrinsically on a sentiment-annotated dataset pushing its coverage by 13% over the previous version. We named the new lexicon SenZi-Large and released it publicly.

2019

pdf bib
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
Taha Tobaili | Miriam Fernandez | Harith Alani | Sanaa Sharafeddine | Hazem Hajj | Goran Glavaš
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Arabizi is an informal written form of dialectal Arabic transcribed in Latin alphanumeric characters. It has a proven popularity on chat platforms and social media, yet it suffers from a severe lack of natural language processing (NLP) resources. As such, texts written in Arabizi are often disregarded in sentiment analysis tasks for Arabic. In this paper we describe the creation of a sentiment lexicon for Arabizi that was enriched with word embeddings. The result is a new Arabizi lexicon consisting of 11.3K positive and 13.3K negative words. We evaluated this lexicon by classifying the sentiment of Arabizi tweets achieving an F1-score of 0.72. We provide a detailed error analysis to present the challenges that impact the sentiment analysis of Arabizi.

2016

pdf bib
Arabizi Identification in Twitter Data
Taha Tobaili
Proceedings of the ACL 2016 Student Research Workshop