Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Bharathi R. Chakravarthi, Ruba Priyadharshini, Anand Kumar M, Sajeetha Thavareesan, Elizabeth Sherly (Editors)

Anthology ID:: 2023.dravidianlangtech-1
Month:: September
Year:: 2023
Address:: Varna, Bulgaria
Venues:: DravidianLangTech | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
URL:: https://aclanthology.org/2023.dravidianlangtech-1
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2023.dravidianlangtech-1.pdf

pdf bib
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
Bharathi R. Chakravarthi | Ruba Priyadharshini | Anand Kumar M | Sajeetha Thavareesan | Elizabeth Sherly

pdf bib abs
On the Errors in Code-Mixed Tamil-English Offensive Span Identification
Manikandan Ravikiran | Bharathi Raja Chakravarthi

In recent times, offensive span identification in code-mixed Tamil-English language has seen traction with the release of datasets, shared tasks, and the development of multiple methods. However, the details of various errors shown by these methods are currently unclear. This paper presents a detailed analysis of various errors in state-of-the-art Tamil-English offensive span identification methods. Our study reveals the strengths and weaknesses of the widely used sequence labeling and zero-shot models for offensive span identification. In the due process, we identify data-related errors, improve data annotation and release additional diagnostic data to evaluate models’ quality and stability. Disclaimer: This paper contains examples that may be considered profane, vulgar, or offensive. The examples do not represent the views of the authors or their employers/graduate schools towards any person(s), group(s), practice(s), or entity/entities. Instead, they emphasize the complexity of various errors and linguistic research challenges.

pdf bib abs
Hate and Offensive Keyword Extraction from CodeMix Malayalam Social Media Text Using Contextual Embedding
Mariya Raphel | Premjith B | Sreelakshmi K | Bharathi Raja Chakravarthi

This paper focuses on identifying hate and offensive keywords from codemix Malayalam social media text. As part of this work, a dataset for hate and offensive keyword extraction for codemix Malayalam language was created. Two different methods were experimented to extract Hate and Offensive language (HOL) keywords from social media text. In the first method, intrinsic evaluation was performed on the dataset to identify the hate and offensive keywords. Three different approaches namely – unigram approach, bigram approach and trigram approach were performed to extract the HOL keywords, sequence of HOL words and the sequence that contribute HOL meaning even in the absence of a HOL word. Five different transformer models were used in each of the pproaches for extracting the embeddings for the ngrams. Later, HOL keywords were extracted based on the similarity score obtained using the cosine similarity. Out of the five transformer models, the best results were obtained with multilingual BERT. In the second method, multilingual BERT transformer model was fine tuned with the dataset to develop a HOL keyword tagger model. This work is a new beginning for HOL keyword identification in Dravidian language – Malayalam.

pdf bib abs
Acoustic Analysis of the Fifth Liquid in Malayalam
Punnoose A K

This paper investigates the claim of rhoticity of the fifth liquid in Malayalam using various acoustic characteristics. The Malayalam liquid phonemes are analyzed in terms of the smoothness of the pitch window, formants, formant bandwidth, the effect on surrounding vowels, duration, and classification patterns by an unrelated classifier. We report, for the fifth liquid, a slight similarity in terms of pitch smoothness with one of the laterals, similarity with the laterals in terms of F1 for males, and similarity with the laterals and one of the rhotics in terms of F1 for females. The similarity in terms of formant bandwidth between the fifth liquid and the other liquids is inconclusive. Similarly, the effect of the fifth liquid on the surrounding vowels is inconclusive. No similarity is observed between the fifth liquid and the other liquids in phoneme duration. Classification of the fifth liquid section implies higher order signal level similarity with both laterals and rhotics.

This paper addresses the challenges faced by Indian languages in leveraging deep learning for natural language processing (NLP) due to limited resources, annotated datasets, and Transformer-based architectures. We specifically focus on Telugu and aim to construct a Telugu morph analyzer dataset comprising 10,000 sentences. Furthermore, we assess the performance of established multi-lingual Transformer models (m-Bert, XLM-R, IndicBERT) and mono-lingual Transformer models trained from scratch on an extensive Telugu corpus comprising 80,15,588 sentences (BERT-Te). Our findings demonstrate the efficacy of Transformer-based representations pretrained on Telugu data in improving the performance of the Telugu morph analyzer, surpassing existing multi-lingual approaches. This highlights the necessity of developing dedicated corpora, annotated datasets, and machine learning models in a mono-lingual setting. We present benchmark results for the Telugu morph analyzer achieved through simple fine-tuning on our dataset.

Reinforcement learning (RL) agents have achieved remarkable success in various domains, such as game-playing and protein structure prediction. However, most RL agents rely on exploration to find optimal solutions without explicit guidance. This paper proposes a methodology for training RL agents using text-based instructions in Dravidian Languages, including Telugu, Tamil, and Malayalam along with using the English language. The agents are trained in a modified Lunar Lander environment, where they must follow specific paths to successfully land the lander. The methodology involves collecting a dataset of human demonstrations and textual instructions, encoding the instructions into numerical representations using text-based embeddings, and training RL agents using state-of-the-art algorithms. The results demonstrate that the trained Soft Actor-Critic (SAC) agent can effectively understand and generalize instructions in different languages, outperforming other RL algorithms such as Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG).

pdf bib abs
Social Media Data Analysis for Malayalam YouTube Comments: Sentiment Analysis and Emotion Detection using ML and DL Models
Abeera V P | Dr. Sachin Kumar | Dr. Soman K P

In this paper, we present a study on social media data analysis of Malayalam YouTube comments, specifically focusing on sentiment analysis and emotion detection. Our research aims to investigate the effectiveness of various machine learning (ML) and deep learning (DL) models in addressing these two tasks. For sentiment analysis, we collected a dataset consisting of 3064 comments, while for two-class emotion detection, we used a dataset of 817 comments. In the sentiment analysis phase, we explored multiple ML and DL models, including traditional algorithms such as Support Vector Machines (SVM), Naïve Bayes, K-Nearest Neighbors (KNN), MLP Classifier, Decision Tree, and Random Forests. Additionally, we utilized DL models such as Recurrent Neural Networks (RNN), LSTM, and GRU. To enhance the performance of these models, we preprocessed the Malayalam YouTube comments by tokenizing and removing stop words. Experimental results revealed that DL models achieved higher accuracy compared to ML models, indicating their ability to capture the complex patterns and nuances in the Malayalam language. Furthermore, we extended our analysis to emotion detection, which involved dealing with limited annotated data. This task is closely related to social media data analysis. For emotion detection, we employed the same ML models used in the sentiment analysis phase. Our dataset of 817 comments was annotated with two emotions: Happy and Sad. We trained the models to classify the comments into these emotion classes and analyzed the accuracy of the different models.

pdf bib abs
Findings of the Second Shared Task on Offensive Span Identification from Code-Mixed Tamil-English Comments
Manikandan Ravikiran | Ananth Ganesh | Anand Kumar M | R Rajalakshmi | Bharathi Raja Chakravarthi

Maintaining effective control over offensive content is essential on social media platforms to foster constructive online discussions. Yet, when it comes to code-mixed Dravidian languages, the current prevalence of offensive content moderation is restricted to categorizing entire comments, failing to identify specific portions that contribute to the offensiveness. Such limitation is primarily due to the lack of annotated data and open source systems for offensive spans. To alleviate this issue, in this shared task, we offer a collection of Tamil-English code-mixed social comments that include offensive comments. This paper provides an overview of the released dataset, the algorithms employed, and the outcomes achieved by the systems submitted for this task.

This document contains the instructions for preparing a manuscript for the proceedings of RANLP 2023. The document itself conforms to its own specifications and is therefore an example of what your manuscript should look like. These instructions should be used for both papers submitted for review and for final versions of accepted papers. Authors are asked to conform to all the directions reported in this document.

In recent years, there has been a growing focus on Sentiment Analysis (SA) of code-mixed Dravidian languages. However, the majority of social media text in these languages is code-mixed, presenting a unique challenge. Despite this, there is currently lack of research on SA specifically tailored for code-mixed Dravidian languages, highlighting the need for further exploration and development in this domain. In this view, “Sentiment Analysis in Tamil and Tulu- DravidianLangTech” shared task at Recent Advances in Natural Language Processing (RANLP)- 2023 is organized. This shred consists two language tracks: code-mixed Tamil and Tulu and Tulu text is first ever explored in public domain for SA. We describe the task, its organization, and the submitted systems followed by the results. 57 research teams registered for the shared task and We received 27 systems each for code-mixed Tamil and Tulu texts. The performance of the systems (developed by participants) has been evaluated in terms of macro average F1 score. The top system for code-mixed Tamil and Tulu texts scored macro average F1 score of 0.32, and 0.542 respectively. The high quality and substantial quantity of submissions demonstrate a significant interest and attention in the analysis of code-mixed Dravidian languages. However, the current state of the art in this domain indicates the need for further advancements and improvements to effectively address the challenges posed by code-mixed Dravidian language SA.

This paper summarizes the shared task on multimodal abusive language detection and sentiment analysis in Dravidian languages as part of the third Workshop on Speech and Language Technologies for Dravidian Languages at RANLP 2023. This shared task provides a platform for researchers worldwide to submit their models on two crucial social media data analysis problems in Dravidian languages - abusive language detection and sentiment analysis. Abusive language detection identifies social media content with abusive information, whereas sentiment analysis refers to the problem of determining the sentiments expressed in a text. This task aims to build models for detecting abusive content and analyzing fine-grained sentiment from multimodal data in Tamil and Malayalam. The multimodal data consists of three modalities - video, audio and text. The datasets for both tasks were prepared by collecting videos from YouTube. Sixty teams participated in both tasks. However, only two teams submitted their results. The submissions were evaluated using macro F1-score.

This paper discusses the submissions to the shared task on abusive comment detection in Tamil and Telugu codemixed social media text conducted as part of the third Workshop on Speech and Language Technologies for Dravidian Languages at RANLP 20239. The task encourages researchers to develop models to detect the contents containing abusive information in Tamil and Telugu codemixed social media text. The task has three subtasks - abusive comment detection in Tamil, Tamil-English and Telugu-English. The dataset for all the tasks was developed by collecting comments from YouTube. The submitted models were evaluated using macro F1-score, and prepared the rank list accordingly.

pdf bib abs
CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus
Nikhil E | Mukund Choudhary | Radhika Mamidi

We present CoPara, the first publicly available paragraph-level (n-way aligned) multilingual parallel corpora for Dravidian languages. The collection contains 2856 paragraph/passage pairs between English and four Dravidian languages. We source the parallel paragraphs from the New India Samachar magazine and align them with English as a pivot language. We do human and artificial evaluations to validate the high-quality alignment and richness of the parallel paragraphs of a range of lengths. To show one of the many ways this dataset can be wielded, we finetuned IndicBART, a seq2seq NMT model on all XX-En pairs of languages in CoPara which perform better than existing sentence-level models on standard benchmarks (like BLEU) on sentence level translations and longer text too. We show how this dataset can enrich a model trained for a task like this, with more contextual cues and beyond sentence understanding even in low-resource settings like that of Dravidian languages. Finally, the dataset and models are made available publicly at CoPara to help advance research in Dravidian NLP, parallel multilingual, and beyond sentence-level tasks like NMT, etc.

pdf bib abs
ChatGPT_Powered_Tourist_Aid_Applications__Proficient_in_Hindi__Yet_To_Master_Telugu_and_Kannada
Sanjana Kolar | Rohit Kumar

This research investigates the effectiveness of Chat- GPT, an AI language model by OpenAI, in translating English into Hindi, Telugu, and Kannada languages, aimed at assisting tourists in India’s linguistically diverse environment. To measure the translation quality, a test set of 50 questions from diverse fields such as general knowledge, food, and travel was used. These were assessed by five volunteers for accuracy and fluency, and the scores were subsequently converted into a BLEU score. The BLEU score evaluates the closeness of a machine-generated translation to a human translation, with a higher score indicating better translation quality. The Hindi translations outperformed others, showcasing superior accuracy and fluency, whereas Telugu translations lagged behind. Human evaluators rated both the accuracy and fluency of translations, offering a comprehensive perspective on the language model’s performance.

As one of the most extensively used languages in India, Telugu has a sizable audience and a huge library of news articles. Predicting the categories of Telugu news items not only helps with efficient organization but also makes it possible to do trend research, advertise in a certain demographic, and provide individualized recommendations. In order to identify the most effective method for accurate Telugu news category prediction, this study compares and contrasts various machine learning (ML) techniques, including support vector machines (SVM), random forests, and naive Bayes. Accuracy, precision, recall, and F1-score will be utilized as performance indicators to gauge how well these algorithms perform. The outcomes of this comparative analysis will address the particular difficulties and complexities of the Telugu language and add to the body of knowledge on news category prediction. For Telugu-speaking consumers, the study intends to improve news organization and recommendation systems, giving them more relevant and customized news consumption experiences. Our result emphasize that, although other models can be taken into account for further research and comparison, W2Vec-skip gram with polynomial SVM is the best performing combination.

Automatic Speech Recognition and its applications are rising in popularity across applications with reasonable inference results. Recent state-of-the-art approaches, often employ significantly large-scale models to show high accuracy for ASR as a whole but often do not consider detailed analysis of performance across low-resource languages applications. In this preliminary work, we propose to revisit ASR in the context of Connected Number Recognition (CNR). More specifically, we (i) present a new dataset HCNR collected to understand various errors of ASR models for CNR, (ii) establish preliminary benchmark and baseline model for CNR, (iii) explore error mitigation strategies and their after-effects on CNR. In the due process, we also compare with end-to-end large scale ASR models for reference, to show its effectiveness.

pdf bib abs
Poorvi@DravidianLangTech: Sentiment Analysis on Code-Mixed Tulu and Tamil Corpus
Poorvi Shetty

Sentiment analysis in code-mixed languages poses significant challenges, particularly for highly under-resourced languages such as Tulu and Tamil. Existing corpora, primarily sourced from YouTube comments, suffer from class imbalance across sentiment categories. Moreover, the limited number of samples in these corpus hampers effective sentiment classification. This study introduces a new corpus tailored for sentiment analysis in Tulu code-mixed texts. The research applies standard pre-processing techniques to ensure data quality and consistency and handle class imbalance. Subsequently, multiple classifiers are employed to analyze the sentiment of the code-mixed texts, yielding promising results. By leveraging the new corpus, the study contributes to advancing sentiment analysis techniques in under-resourced code-mixed languages. This work serves as a stepping stone towards better understanding and addressing the challenges posed by sentiment analysis in highly under-resourced languages.

pdf bib abs
NLP_SSN_CSE@DravidianLangTech: Fake News Detection in Dravidian Languages using Transformer Models
Varsha Balaji | Shahul Hameed T | Bharathi B

The proposed system procures a systematic workflow in fake news identification utilizing machine learning classification in order to recognize and distinguish between real and made-up news. Using the Natural Language Toolkit (NLTK), the procedure starts with data preprocessing, which includes operations like text cleaning, tokenization, and stemming. This guarantees that the data is translated into an analytically-ready format. The preprocessed data is subsequently supplied into transformer models like M-BERT, Albert, XLNET, and BERT. By utilizing their extensive training on substantial datasets to identify complex patterns and significant traits that discriminate between authentic and false news pieces, these transformer models excel at capturing contextual information. The most successful model among those used is M-BERT, which boasts an astounding F1 score of 0.74. This supports M-BERT’s supremacy over its competitors in the field of fake news identification, outperforming them in terms of performance. The program can draw more precise conclusions and more effectively counteract the spread of false information because of its comprehension of contextual nuance. Organizations and platforms can strengthen their fake news detection systems and their attempts to stop the spread of false information by utilizing M-BERT’s capabilities.

pdf bib abs
AbhiPaw@DravidianLangTech: Multimodal Abusive Language Detection and Sentiment Analysis
Abhinaba Bala | Parameswari Krishnamurthy

Detecting abusive language in multimodal videos has become a pressing need in ensuring a safe and inclusive online environment. This paper focuses on addressing this challenge through the development of a novel approach for multimodal abusive language detection in Tamil videos and sentiment analysis for Tamil/Malayalam videos. By leveraging state-of-the-art models such as Multiscale Vision Transformers (MViT) for video analysis, OpenL3 for audio analysis, and the bert-base-multilingual-cased model for textual analysis, our proposed framework integrates visual, auditory, and textual features. Through extensive experiments and evaluations, we demonstrate the effectiveness of our model in accurately detecting abusive content and predicting sentiment categories. The limited availability of effective tools for performing these tasks in Dravidian Languages has prompted a new avenue of research in these domains.

pdf bib abs
Athena@DravidianLangTech: Abusive Comment Detection in Code-Mixed Languages using Machine Learning Techniques
Hema M | Anza Prem | Rajalakshmi Sivanaiah | Angel Deborah S

The amount of digital material that is disseminated through various social media platforms has significantly increased in recent years. Online networks have gained popularity in recent years and have established themselves as goto resources for news, information, and entertainment. Nevertheless, despite the many advantages of using online networks, mounting evidence indicates that an increasing number of malicious actors are taking advantage of these networks to spread poison and hurt other people. This work aims to detect abusive content in youtube comments written in the languages like Tamil, Tamil-English (codemixed), Telugu-English (code-mixed). This work was undertaken as part of the “DravidianLangTech@ RANLP 2023” shared task. The Macro F1 values for the Tamil, Tamil-English, and Telugu-English datasets were 0.28, 0.37, and 0.6137 and secured 5th, 7th, 8th rank respectively.

pdf bib abs
AlphaBrains@DravidianLangTech: Sentiment Analysis of Code-Mixed Tamil and Tulu by Training Contextualized ELMo Word Representations
Toqeer Ehsan | Amina Tehseen | Kengatharaiyer Sarveswaran | Amjad Ali

Sentiment analysis in natural language processing (NLP), endeavors to computationally identify and extract subjective information from textual data. In code-mixed text, sentiment analysis presents a unique challenge due to the mixing of languages within a single textual context. For low-resourced languages such as Tamil and Tulu, predicting sentiment becomes a challenging task due to the presence of text comprising various scripts. In this research, we present the sentiment analysis of code-mixed Tamil and Tulu Youtube comments. We have developed a Bidirectional Long-Short Term Memory (BiLSTM) networks based models for both languages which further uses contextualized word embeddings at input layers of the models. For that purpose, ELMo embeddings have been trained on larger unannotated code-mixed text like corpora. Our models performed with macro average F1-scores of 0.2877 and 0.5133 on Tamil and Tulu code-mixed datasets respectively.

pdf bib abs
HARMONY@DravidianLangTech: Transformer-based Ensemble Learning for Abusive Comment Detection
Amrish Raaj P | Abirami Murugappan | Lysa Packiam R S | Deivamani M

Millions of posts and comments are created every minute as a result of the widespread use of social media and easy access to the internet.It is essential to create an inclusive environment and forbid the use of abusive language against any individual or group of individuals.This paper describes the approach of team HARMONY for the “Abusive Comment Detection” shared task at the Third Workshop on Speech and Language Technologies for Dravidian Languages.A Transformer-based ensemble learning approach is proposed for detecting abusive comments in code-mixed (Tamil-English) language and Tamil language. The proposed architecture achieved rank 2 in Tamil text classification sub task and rank 3 in code mixed text classification sub task with macro-F1 score of 0.41 for Tamil and 0.50 for code-mixed data.

pdf bib abs
Avalanche at DravidianLangTech: Abusive Comment Detection in Code Mixed Data Using Machine Learning Techniques with Under Sampling
Rajalakshmi Sivanaiah | Rajasekar S | Srilakshmisai K | Angel Deborah S | Mirnalinee ThankaNadar

In recent years, the growth of online platforms and social media has given rise to a concerning increase in the presence of abusive content. This poses significant challenges for maintaining a safe and inclusive digital environment. In order to resolve this issue, this paper experiments an approach for detecting abusive comments. We are using a combination of pipelining and vectorization techniques, along with algorithms such as the stochastic gradient descent (SGD) classifier and support vector machine (SVM) classifier. We conducted experiments on an Tamil-English code mixed dataset to evaluate the performance of this approach. Using the stochastic gradient descent classifier algorithm, we achieved a weighted F1 score of 0.76 and a macro score of 0.45 for development dataset. Furthermore, by using the support vector machine classifier algorithm, we obtained a weighted F1 score of 0.78 and a macro score of 0.42 for development dataset. With the test dataset, SGD approach secured 5th rank with 0.44 macro F1 score, while SVM scored 8th rank with 0.35 macro F1 score in the shared task. The top rank team secured 0.55 macro F1 score.

pdf bib abs
DeepBlueAI@DravidianLangTech-RANLP 2023
Zhipeng Luo | Jiahui Wang

This paper presents a study on the language understanding of the Dravidian languages. Three specific tasks related to text classification are focused on in this study, including abusive comment detection, sentiment analysis and fake news detection. The paper provides a detailed description of the tasks, including dataset information and task definitions, as well as the model architectures and training details used to tackle them. Finally, the competition results are presented, demonstrating the effectiveness of the proposed approach for handling these challenging NLP tasks in the context of the Dravidian languages.

pdf bib abs
Selam@DravidianLangTech:Sentiment Analysis of Code-Mixed Dravidian Texts using SVM Classification
Selam Kanta | Grigori Sidorov

Sentiment analysis in code-mixed text written in Dravidian languages. Specifically, Tamil- English and Tulu-English. This paper describes the system paper of the RANLP-2023 shared task. The goal of this shared task is to develop systems that accurately classify the sentiment polarity of code-mixed comments and posts. be provided with development, training, and test data sets containing code-mixed text in Tamil- English and Tulu-English. The task involves message-level polarity classification, to classify YouTube comments into positive, negative, neutral, or mixed emotions. This Code- Mix was compiled by RANLP-2023 organizers from posts on social media. We use classification techniques SVM and achieve an F1 score of 0.147 for Tamil-English and 0.518 for Tulu- English.

pdf bib abs
LIDOMA@DravidianLangTech: Convolutional Neural Networks for Studying Correlation Between Lexical Features and Sentiment Polarity in Tamil and Tulu Languages
Moein Tash | Jesus Armenta-Segura | Zahra Ahani | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh

With the prevalence of code-mixing among speakers of Dravidian languages, DravidianLangTech proposed the shared task on Sentiment Analysis in Tamil and Tulu at RANLP 2023. This paper presents the submission of LIDOMA, which proposes a methodology that combines lexical features and Convolutional Neural Networks (CNNs) to address the challenge. A fine-tuned 6-layered CNN model is employed, achieving macro F1 scores of 0.542 and 0.199 for Tulu and Tamil, respectively

pdf bib abs
nlpt malayalm@DravidianLangTech : Fake News Detection in Malayalam using Optimized XLM-RoBERTa Model
Eduri Raja | Badal Soni | Sami Kumar Borgohain

The paper demonstrates the submission of the team nlpt_malayalm to the Fake News Detection in Dravidian Languages-DravidianLangTech@LT-EDI-2023. The rapid dissemination of fake news and misinformation in today’s digital age poses significant societal challenges. This research paper addresses the issue of fake news detection in the Malayalam language by proposing a novel approach based on the XLM-RoBERTa base model. The objective is to develop an effective classification model that accurately differentiates between genuine and fake news articles in Malayalam. The XLM-RoBERTa base model, known for its multilingual capabilities, is fine-tuned using the prepared dataset to adapt it specifically to the nuances of the Malayalam language. A thorough analysis is also performed to identify any biases or limitations in the model’s performance. The results demonstrate that the proposed model achieves a remarkable macro-averaged F-Score of 87% in the Malayalam fake news dataset, ranking 2nd on the respective task. This indicates its high accuracy and reliability in distinguishing between real and fake news in Malayalam.

pdf bib abs
ML&AI_IIITRanchi@DravidianLangTech: Fine-Tuning IndicBERT for Exploring Language-specific Features for Sentiment Classification in Code-Mixed Dravidian Languages
Kirti Kumari | Shirish Shekhar Jha | Zarikunte Kunal Dayanand | Praneesh Sharma

Code-mixing presents challenges to sentiment analysis due to limited availability of annotated data found on low-resource languages such as Tulu. To address this issue, comprehensive work was done in creating a gold-standard labeled corpus that incorporates both languages while facilitating accurate analyses of sentiments involved. Encapsulated within this research was the employed use of varied techniques including data collection, cleaning processes as well as preprocessing leading up to effective annotation along with finding results using fine tuning indic bert and performing experiments over tf-idf plus bag of words. The outcome is an invaluable resource for developing custom-tailored models meant solely for analyzing sentiments involved with code mixed texts across Tamil and Tulu domain limits; allowing a focused insight into what makes up such expressions. Remarkably, the adoption of hybrid models yielded promising outcomes, culminating in a 10th rank achievement for Tulu, and a 14thrank achievement for Tamil, supported by an macro F1 score of 0.471 and 0.124 respectively.

pdf bib abs
ML&AI_IIITRanchi@DravidianLangTech:Leveraging Transfer Learning for the discernment of Fake News within the Linguistic Domain of Dravidian Language
Kirti Kumari | Shirish Shekhar Jha | Zarikunte Kunal Dayanand | Praneesh Sharma

The primary focus of this research endeavor lies in detecting and mitigating misinformation within the intricate framework of the Dravidian language. A notable feat was achieved by employing fine-tuning methodologies on the highly acclaimed Indic BERT model, securing a commendable fourth rank in a prestigious competition organized by DravidianLangTech 2023 while attaining a noteworthy macro F1-Score of 0.78. To facilitate this undertaking, a diverse and comprehensive dataset was meticulously gathered from prominent social media platforms, including but not limited to Facebook and Twitter. The overarching objective of this collaborative initiative was to proficiently discern and categorize news articles into either the realm of veracity or deceit through the astute application of advanced machine learning techniques, coupled with the astute exploitation of the distinctive linguistic idiosyncrasies inherent to the Dravidian language.

pdf bib abs
NITK-IT-NLP@DravidianLangTech: Impact of Focal Loss on Malayalam Fake News Detection using Transformers
Hariharan R L | Anand Kumar M

Fake News Detection in Dravidian Languages is a shared task that identifies youtube comments in the Malayalam language for fake news detection. In this work, we have proposed a transformer-based model with cross-entropy loss and focal loss, which classifies the comments into fake or authentic news. We have used different transformer-based models for the dataset with modifications in the experimental setup, out of which the fine-tuned model, which is based on MuRIL with focal loss, achieved the best overall macro F1-score of 0.87, and we got second position in the final leaderboard.

pdf bib abs
VEL@DravidianLangTech: Sentiment Analysis of Tamil and Tulu
Kishore Kumar Ponnusamy | Charmathi Rajkumar | Prasanna Kumar Kumaresan | Elizabeth Sherly | Ruba Priyadharshini

We participated in the Sentiment Analysis in Tamil and Tulu - DravidianLangTech 2023-RANLP 2023 task in the team name of VEL. This research focuses on addressing the challenge of detecting sentiment analysis in social media code-mixed comments written in Tamil and Tulu languages. Code-mixed text in social media often deviates from strict grammar rules and incorporates non-native scripts, making sentiment identification a complex task. To tackle this issue, we employ pre-processing techniques to remove unnecessary content and develop a model specifically designed for sentiment analysis detection. Additionally, we explore the effectiveness of traditional machine-learning models combined with feature extraction techniques. Our best model logistic regression configurations achieve impressive macro F1 scores of 0.43 on the Tamil test set and 0.51 on the Tulu test set, indicating promising results in accurately detecting instances of sentiment in code-mixed comments.

pdf bib abs
hate-alert@DravidianLangTech: Multimodal Abusive Language Detection and Sentiment Analysis in Dravidian Languages
Shubhankar Barman | Mithun Das

The use of abusive language on social media platforms is a prevalent issue that requires effective detection. Researchers actively engage in abusive language detection and sentiment analysis on social media platforms. However, most of the studies are in English. Hence, there is a need to develop models for low-resource languages. Further, the multimodal content in social media platforms is expanding rapidly. Our research aims to address this gap by developing a multimodal abusive language detection and performing sentiment analysis for Tamil and Malayalam, two under-resourced languages, based on the shared task Multimodal Abusive Language Detection and Sentiment Analysis in Dravidian Languages: DravidianLangTech@RANLP 2023”. In our study, we conduct extensive experiments utilizing multiple deep-learning models to detect abusive language in Tamil and perform sentiment analysis in Tamil and Malayalam. For feature extraction, we use the mBERT transformer-based model for texts, the ViT model for images and MFCC for audio. In the abusive language detection task, we achieved a weighted average F1 score of 0.5786, securing the first rank in this task. For sentiment analysis, we achieved a weighted average F1 score of 0.357 for Tamil and 0.233 for Malayalam, ranking first in this task.

pdf bib abs
Supernova@DravidianLangTech 2023@Abusive Comment Detection in Tamil and Telugu - (Tamil, Tamil-English, Telugu-English)
Ankitha Reddy | Pranav Moorthi | Ann Maria Thomas

This paper focuses on using Support Vector Machines (SVM) classifiers with TF-IDF feature extraction to classify whether a comment is abusive or not.The paper tries to identify abusive content in regional languages.The dataset analysis presents the distribution of target variables in the Tamil-English, Telugu-English, and Tamil datasets.The methodology section describes the preprocessing steps, including consistency, removal of special characters and emojis, removal of stop words, and stemming of data. Overall, the study contributes to the field of abusive comment detection in Tamil and Telugu languages.

pdf bib abs
AbhiPaw@ DravidianLangTech: Abusive Comment Detection in Tamil and Telugu using Logistic Regression
Abhinaba Bala | Parameswari Krishnamurthy

Abusive comments in online platforms have become a significant concern, necessitating the development of effective detection systems. However, limited work has been done in low resource languages, including Dravidian languages. This paper addresses this gap by focusing on abusive comment detection in a dataset containing Tamil, Tamil-English and Telugu-English code-mixed comments. Our methodology involves logistic regression and explores suitable embeddings to enhance the performance of the detection model. Through rigorous experimentation, we identify the most effective combination of logistic regression and embeddings. The results demonstrate the performance of our proposed model, which contributes to the development of robust abusive comment detection systems in low resource language settings. Keywords: Abusive comment detection, Dravidian languages, logistic regression, embeddings, low resource languages, code-mixed dataset.

pdf bib abs
AbhiPaw@ DravidianLangTech: Fake News Detection in Dravidian Languages using Multilingual BERT
Abhinaba Bala | Parameswari Krishnamurthy

This study addresses the challenge of detecting fake news in Dravidian languages by leveraging Google’s MuRIL (Multilingual Representations for Indian Languages) model. Drawing upon previous research, we investigate the intricacies involved in identifying fake news and explore the potential of transformer-based models for linguistic analysis and contextual understanding. Through supervised learning, we fine-tune the “muril-base-cased” variant of MuRIL using a carefully curated dataset of labeled comments and posts in Dravidian languages, enabling the model to discern between original and fake news. During the inference phase, the fine-tuned MuRIL model analyzes new textual content, extracting contextual and semantic features to predict the content’s classification. We evaluate the model’s performance using standard metrics, highlighting the effectiveness of MuRIL in detecting fake news in Dravidian languages and contributing to the establishment of a safer digital ecosystem. Keywords: fake news detection, Dravidian languages, MuRIL, transformer-based models, linguistic analysis, contextual understanding.

pdf bib abs
Habesha@DravidianLangTech: Utilizing Deep and Transfer Learning Approaches for Sentiment Analysis.
Mesay Gemeda Yigezu | Tadesse Kebede | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh

This research paper focuses on sentiment analysis of Tamil and Tulu texts using a BERT model and an RNN model. The BERT model, which was pretrained, achieved satisfactory performance for the Tulu language, with a Macro F1 score of 0.352. On the other hand, the RNN model showed good performance for Tamil language sentiment analysis, obtaining a Macro F1 score of 0.208. As future work, the researchers aim to fine-tune the models to further improve their results after the training process.

pdf bib abs
Habesha@DravidianLangTech: Abusive Comment Detection using Deep Learning Approach
Mesay Gemeda Yigezu | Selam Kanta | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh

This research focuses on identifying abusive language in comments. The study utilizes deep learning models, including Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs), to analyze linguistic patterns. Specifically, the LSTM model, a type of RNN, is used to understand the context by capturing long-term dependencies and intricate patterns in the input sequences. The LSTM model achieves better accuracy and is enhanced through the addition of a dropout layer and early stopping. For detecting abusive language in Telugu and Tamil-English, an LSTM model is employed, while in Tamil abusive language detection, a word-level RNN is developed to identify abusive words. These models process text sequentially, considering overall content and capturing contextual dependencies.

pdf bib abs
SADTech@DravidianLangTech: Multimodal Sentiment Analysis of Tamil and Malayalam
Abhinav Patil | Sam Briggs | Tara Wueger | Daniel D. O’Connell

We present several models for sentiment analysis of multimodal movie reviews in Tamil and Malayalam into 5 separate classes: highly negative, negative, neutral, positive, and highly positive, based on the shared task, “Multimodal Abusive Language Detection and Sentiment Analysis” at RANLP-2023. We use transformer language models to build text and audio embeddings and then compare the performance of multiple classifier models trained on these embeddings: a Multinomial Naive Bayes baseline, a Logistic Regression, a Random Forest, and an SVM. To account for class imbalance, we use both naive resampling and SMOTE. We found that without resampling, the baseline models have the same performance as a naive Majority Class Classifier. However, with resampling, logistic regression and random forest both demonstrate gains over the baseline.

pdf bib abs
MUCS@DravidianLangTech2023: Sentiment Analysis in Code-mixed Tamil and Tulu Texts using fastText
Rachana K | Prajnashree M | Asha Hegde | H. L Shashirekha

Sentiment Analysis (SA) is a field of computational study that focuses on analyzing and understanding people’s opinions, attitudes, and emotions towards an entity. An entity could be an individual, an event, a topic, a product etc., which is most likely to be covered by reviews and such reviews can be found in abundance on social media platforms. The increase in the number of social media users and the growing amount of user-generated code-mixed content such as reviews, comments, posts etc., on social media have resulted in a rising demand for efficient tools capable of effectively analyzing such content to detect the sentiments. However, SA of social media text is challenging due to the complex nature of the code-mixed text. To tackle this issue, in this paper, we team MUCS, describe learning models submitted to “Sentiment Analysis in Tamil and Tulu” -DravidianLangTech@Recent Advances In Natural Language Processing (RANLP) 2023. Using fastText embeddings to train the Machine Learning (ML) models to perform SA in code-mixed Tamil and Tulu texts, the proposed methodology exhibited F1 scores of 0.14 and 0.204 securing 13th and 15th rank for Tamil and Tulu texts respectively.

pdf bib abs
MUCS@DravidianLangTech2023: Leveraging Learning Models to Identify Abusive Comments in Code-mixed Dravidian Languages
Asha Hegde | Kavya G | Sharal Coelho | Hosahalli Lakshmaiah Shashirekha

Abusive language detection in user-generated online content has become a pressing concern due to its negative impact on users and challenges for policy makers. Online platforms are faced with the task of moderating abusive content to mitigate societal harm, adhere to legal requirements, and foster inclusivity. Despite numerous methods developed for automated detection of abusive language, the problem continues to persist. This ongoing challenge necessitates further research and development to enhance the effectiveness of abusive content detection systems and implement proactive measures to create safer and more respectful online spaces. To address the automatic detection of abusive languages in social media platforms, this paper describes the models submitted by our team - MUCS to the shared task “Abusive Comment Detection in Tamil and Telugu” at DravidianLangTech - in Recent Advances in Natural Language Processing (RANLP) 2023. This shared task addresses the abusive comment detection in code-mixed Tamil, Telugu, and romanized Tamil (Tamil-English) texts. Two distinct models: i) AbusiveML - a model implemented utilizing Linear Support Vector Classifier (LinearSVC) algorithm fed with n-grams of words and character sequences within word boundary (char_wb) features and ii) AbusiveTL - a Transfer Learning (TL ) model with three different Bidirectional Encoder Representations from Transformers (BERT) models along with random oversampling to deal with data imbalance, are submitted to the shared task for detecting abusive language in the given code-mixed texts. The AbusiveTL model fared well among these two models, with macro F1 scores of 0.46, 0.74, and 0.49 for code-mixed Tamil, Telugu, and Tamil-English texts respectively.

pdf bib abs
MUNLP@DravidianLangTech2023: Learning Approaches for Sentiment Analysis in Code-mixed Tamil and Tulu Text
Asha Hegde | Kavya G | Sharal Coelho | Pooja Lamani | Hosahalli Lakshmaiah Shashirekha

Sentiment Analysis (SA) examines the subjective content of a statement, such as opinions, assessments, feelings, or attitudes towards a subject, person, or a thing. Though several models are developed for SA in high-resource languages like English, Spanish, German, etc., uder-resourced languages like Dravidian languages are less explored. To address the challenges of SA in low resource Dravidian languages, in this paper, we team MUNLP describe the models submitted to “Sentiment Analysis in Tamil and Tulu- DravidianLangTech” shared task at Recent Advances in Natural Language Processing (RANLP)-2023. n-gramsSA, EmbeddingsSA and BERTSA are the models proposed for SA shared task. Among all the models, BERTSA exhibited a maximum macro F1 score of 0.26 for code-mixed Tamil texts securing 2nd place in the shared task. EmbeddingsSA exhibited maximum macro F1 score of 0.53 securing 2nd place for Tulu code-mixed texts.

pdf bib abs
MUCSD@DravidianLangTech2023: Predicting Sentiment in Social Media Text using Machine Learning Techniques
Sharal Coelho | Asha Hegde | Pooja Lamani | Kavya G | Hosahalli Lakshmaiah Shashirekha

User-generated social media texts are a blend of resource-rich languages like English and low-resource Dravidian languages like Tamil, Kannada, Tulu, etc. These texts referred to as code-mixing texts are enriching social media since they are written in two or more languages using either a common language script or various language scripts. However, due to the complex nature of the code-mixed text, in this paper, we - team MUCSD, describe a Machine learning (ML) models submitted to “Sentiment Analysis in Tamil and Tulu” shared task at DravidianLangTech@RANLP 2023. The proposed methodology makes use of ML models such as Linear Support Vector Classifier (LinearSVC), LR, and ensemble model (LR, DT, and SVM) to perform SA in Tamil and Tulu languages. The proposed LinearSVC model’s predictions submitted to the shared tasks, obtained 8th and 9th rank for Tamil-English and Tulu-English respectively.

pdf bib abs
MUCS@DravidianLangTech2023: Malayalam Fake News Detection Using Machine Learning Approach
Sharal Coelho | Asha Hegde | Kavya G | Hosahalli Lakshmaiah Shashirekha

Social media is widely used to spread fake news, which affects a larger population. So it is considered as a very important task to detect fake news spread on social media platforms. To address the challenges in the identification of fake news in the Malayalam language, in this paper, we - team MUCS, describe the Machine Learning (ML) models submitted to “Fake News Detection in Dravidian Languages” at DravidianLangTech@RANLP 2023 shared task. Three different models, namely, Multinomial Naive Bayes (MNB), Logistic Regression (LR), and Ensemble model (MNB, LR, and SVM) are trained using Term Frequency - Inverse Document Frequency (TF-IDF) of word unigrams. Among the three models ensemble model performed better with a macro F1-score of 0.83 and placed 3rd rank in the shared task.

Our work aims to identify the negative comments that is associated with Counter-speech,Xenophobia, Homophobia,Transphobia, Misandry, Misogyny, None-of-the-above categories, In order to identify these categories from the given dataset, we propose three different models such as traditional machine learning techniques, deep learning model and transfer Learning model called BERT is also used to analyze the texts. In the Tamil dataset, we are training the models with Train dataset and test the models with Validation data. Our Team Participated in the shared task organised by DravidianLangTech and secured 4th rank in the task of abusive comment detection in Tamil with a macro- f1 score of 0.35. Also, our run was submitted for abusive comment detection in code-mixed languages (Tamil-English) and secured 6th rank with a macro-f1 score of 0.42.

Sentiment Analysis is a process that involves analyzing digital text to determine the emo- tional tone, such as positive, negative, neu- tral, or unknown. Sentiment Analysis of code- mixed languages presents challenges in natural language processing due to the complexity of code-mixed data, which combines vocabulary and grammar from multiple languages and cre- ates unique structures. The scarcity of anno- tated data and the unstructured nature of code- mixed data are major challenges. To address these challenges, we explored various tech- niques, including Machine Learning models such as Decision Trees, Random Forests, Lo- gistic Regression, and Gaussian Na ̈ıve Bayes, Deep Learning model, such as Long Short- Term Memory (LSTM), and Transfer Learning model like BERT, were also utilized. In this work, we obtained the dataset from the Dravid- ianLangTech shared task by participating in a competition and accessing train, development and test data for Tamil Language. The results demonstrated promising performance in senti- ment analysis of code-mixed text. Among all the models, deep learning model LSTM pro- vides best accuracy of 0.61 for Tamil language.

pdf bib abs
CSSCUTN@DravidianLangTech:Abusive comments Detection in Tamil and Telugu
Kathiravan Pannerselvam | Saranya Rajiakodi | Rahul Ponnusamy | Sajeetha Thavareesan

Code-mixing is a word or phrase-level act of interchanging two or more languages during a conversation or in written text within a sentence. This phenomenon is widespread on social media platforms, and understanding the underlying abusive comments in a code-mixed sentence is a complex challenge. We present our system in our submission for the DravidianLangTech Shared Task on Abusive Comment Detection in Tamil and Telugu. Our approach involves building a multiclass abusive detection model that recognizes 8 different labels. The provided samples are code-mixed Tamil-English text, where Tamil is represented in romanised form. We focused on the Multiclass classification subtask, and we leveraged Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR). Our method exhibited its effectiveness in the shared task by earning the ninth rank out of all competing systems for the classification of abusive comments in the code-mixed text. Our proposed classifier achieves an impressive accuracy of 0.99 and an F1-score of 0.99 for a balanced dataset using TF-IDF with SVM. It can be used effectively to detect abusive comments in Tamil, English code-mixed text