Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Neele Falk, Sara Papi, Mike Zhang (Editors)

Anthology ID:: 2024.eacl-srw
Month:: March
Year:: 2024
Address:: St. Julian’s, Malta
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2024.eacl-srw
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2024.eacl-srw.pdf

pdf bib
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Neele Falk | Sara Papi | Mike Zhang

pdf bib abs
AutoAugment Is What You Need: Enhancing Rule-based Augmentation Methods in Low-resource Regimes
Juhwan Choi | Kyohoon Jin | Junho Lee | Sangmin Song | YoungBin Kim

Text data augmentation is a complex problem due to the discrete nature of sentences. Although rule-based augmentation methods are widely adopted in real-world applications because of their simplicity, they suffer from potential semantic damage. Previous researchers have suggested easy data augmentation with soft labels (softEDA), employing label smoothing to mitigate this problem. However, finding the best factor for each model and dataset is challenging; therefore, using softEDA in real-world applications is still difficult. In this paper, we propose adapting AutoAugment to solve this problem. The experimental results suggest that the proposed method can boost existing augmentation methods and that rule-based methods can enhance cutting-edge pretrained language models. We offer the source code.

pdf bib abs
Generating Diverse Translation with Perturbed kNN-MT
Yuto Nishida | Makoto Morishita | Hidetaka Kamigaito | Taro Watanabe

Generating multiple translation candidates would enable users to choose the one that satisfies their needs.Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem—the model underestimates a prediction that is largely different from the training data, even if that prediction is likely.This paper proposes methods that generate more diverse translations by introducing perturbed k-nearest neighbor machine translation (kNN-MT).Our methods expand the search space of kNN-MT and help incorporate diverse words into candidates by addressing the overcorrection problem.Our experiments show that the proposed methods drastically improve candidate diversity and control the degree of diversity by tuning the perturbation’s magnitude.

Nuanced dialects are a linguistic variant that pose several challenges for NLP models and techniques. One of the main challenges is the limited amount of datasets to enable extensive research and experimentation. We propose an approach for efficiently collecting nuanced dialectal datasets that are not only of high quality, but are versatile enough to be multipurpose as well. To test our approach we collect the KIND corpus, which is a collection of fine-grained Arabic dialect data. The data is short texts, and unlike many nuanced dialectal datasets, it is curated manually through social collaboration efforts as opposed to being crawled from social media. The collaborative approach is incentivized through educational gamification and competitions for which the community itself benefits from the open source dataset. Our approach aims to achieve: (1) coverage of dialects from under-represented groups and fine-grained dialectal varieties, (2) provide aligned parallel corpora for translation between Modern Standard Arabic (MSA) and multiple dialects to enable translation and comparison studies, (3) promote innovative approaches for nuanced dialect data collection. We explain the steps for the competition as well as the resulting datasets and the competing data collection systems. The KIND dataset is shared with the research community.

pdf bib abs
Can Stanza be Used for Part-of-Speech Tagging Historical Polish?
Maria Irena Szawerna

The goal of this paper is to evaluate the performance of Stanza, a part-of-speech (POS) tagger developed for modern Polish, on historical text to assess its possible use for automating the annotation of other historical texts. While the issue of the reliability of utilizing POS taggers on historical data has been previously discussed, most of the research focuses on languages whose grammar differs from Polish, meaning that their results need not be fully applicable in this case. The evaluation of Stanza is conducted on two sets of 10286 and 3270 manually annotated tokens from a piece of historical Polish writing (1899), and the errors are analyzed qualitatively and quantitatively. The results show a good performance of the tagger, especially when it comes to Universal Part-of-Speech (UPOS) tags, which is promising for utilizing the tagger for automatic annotation in larger projects, and pinpoint some common features of misclassified tokens.

pdf bib abs
Toward Zero-Shot Instruction Following
Renze Lou | Wenpeng Yin

This work proposes a challenging yet more realistic setting for zero-shot cross-task generalization: zero-shot instruction following, presuming the existence of a paragraph-style task definition while no demonstrations exist. To better learn the task supervision from the definition, we propose two strategies: first, to automatically find out the critical sentences in the definition; second, a ranking objective to force the model to generate the gold outputs with higher probabilities when those critical parts are highlighted in the definition. The joint efforts of the two strategies yield state-of-the-art performance on the Super-NaturalInstructions. Our code is available on GitHub.

pdf bib abs
UnMASKed: Quantifying Gender Biases in Masked Language Models through Linguistically Informed Job Market Prompts
Iñigo Parra

Language models (LMs) have become pivotal in the realm of technological advancements. While their capabilities are vast and transformative, they often include societal biases encoded in the human-produced datasets used for their training. This research delves into the inherent biases present in masked language models (MLMs), with a specific focus on gender biases. This study evaluated six prominent models: BERT, RoBERTa, DistilBERT, BERT- multilingual, XLM-RoBERTa, and DistilBERT- multilingual. The methodology employed a novel dataset, bifurcated into two subsets: one containing prompts that encouraged models to generate subject pronouns in English and the other requiring models to return the probabilities of verbs, adverbs, and adjectives linked to the prompts’ gender pronouns. The analysis reveals stereotypical gender alignment of all models, with multilingual variants showing comparatively reduced biases.

pdf bib abs
Distribution Shifts Are Bottlenecks: Extensive Evaluation for Grounding Language Models to Knowledge Bases
Yiheng Shu | Zhiwei Yu

Grounding language models (LMs) to knowledge bases (KBs) helps to obtain rich and accurate facts. However, it remains challenging because of the enormous size, complex structure, and partial observability of KBs. One reason is that current benchmarks fail to reflect robustness challenges and fairly evaluate models.This paper analyzes whether these robustness challenges arise from distribution shifts, including environmental, linguistic, and modal aspects.This affects the ability of LMs to cope with unseen schema, adapt to language variations, and perform few-shot learning. Thus, the paper proposes extensive evaluation protocols and conducts experiments to demonstrate that, despite utilizing our proposed data augmentation method, both advanced small and large language models exhibit poor robustness in these aspects. We conclude that current LMs are too fragile to navigate in complex environments due to distribution shifts. This underscores the need for future research focusing on data collection, evaluation protocols, and learning paradigms.

Extracting the attribute value of a product from the given product description is essential for ecommerce functions like product recommendations, search, and information retrieval. Therefore, understanding products in E-commerce. Greater accuracy certainly gives any retailer the edge. The burdensome aspect of this problem lies in the diversity of the products and their attributes and values. Existing solutions typically employ large language models or sequence-tagging approaches to capture the context of a given product description and extract attribute values. However, they do so with limited accuracy, which serves as the underlying motivation to explore a more comprehensive solution. Through this paper, we present a novel approach for attribute value extraction from product description leveraging graphs and graph neural networks. Our proposed method demonstrates improvements in attribute value extraction accuracy compared to the baseline sequence tagging approaches.

pdf bib abs
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs
Cem Uluoglakci | Tugba Temizel

Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs), limiting their widespread acceptance beyond chatbot applications. Despite ongoing efforts, hallucinations remain a prevalent challenge in LLMs. The detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. This paper introduces an automated scalable framework that combines benchmarking LLMs’ hallucination tendencies with efficient hallucination detection. We leverage LLMs to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. We introduce the publicly available HypoTermQA Benchmarking Dataset, on which state-of-the-art models’ performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. The proposed framework provides opportunities to test and improve LLMs. Additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.

pdf bib abs
Arabic Synonym BERT-based Adversarial Examples for Text Classification
Norah Alshahrani | Saied Alshahrani | Esma Wali | Jeanna Matthews

Text classification systems have been proven vulnerable to adversarial text examples, modified versions of the original text examples that are often unnoticed by human eyes, yet can force text classification models to alter their classification. Often, research works quantifying the impact of adversarial text attacks have been applied only to models trained in English. In this paper, we introduce the first word-level study of adversarial attacks in Arabic. Specifically, we use a synonym (word-level) attack using a Masked Language Modeling (MLM) task with a BERT model in a black-box setting to assess the robustness of the state-of-the-art text classification models to adversarial attacks in Arabic. To evaluate the grammatical and semantic similarities of the newly produced adversarial examples using our synonym BERT-based attack, we invite four human evaluators to assess and compare the produced adversarial examples with their original examples. We also study the transferability of these newly produced Arabic adversarial examples to various models and investigate the effectiveness of defense mechanisms against these adversarial examples on the BERT models. We find that fine-tuned BERT models were more susceptible to our synonym attacks than the other Deep Neural Networks (DNN) models like WordCNN and WordLSTM we trained. We also find that fine-tuned BERT models were more susceptible to transferred attacks. We, lastly, find that fine-tuned BERT models successfully regain at least 2% in accuracy after applying adversarial training as an initial defense mechanism.

pdf bib abs
A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models
Marc Braun | Jenny Kunz

The self-rationalising capabilities of LLMs are appealing because the generated explanations can give insights into the plausibility of the predictions. However, how faithful the explanations are to the predictions is questionable, raising the need to explore the patterns behind them further.To this end, we propose a hypothesis-driven statistical framework. We use a Bayesian network to implement a hypothesis about how a task (in our example, natural language inference) is solved, and its internal states are translated into natural language with templates. Those explanations are then compared to LLM-generated free-text explanations using automatic and human evaluations. This allows us to judge how similar the LLM’s and the Bayesian network’s decision processes are. We demonstrate the usage of our framework with an example hypothesis and two realisations in Bayesian networks. The resulting models do not exhibit a strong similarity to GPT-3.5. We discuss the implications of this as well as the framework’s potential to approximate LLM decisions better in future work.

pdf bib abs
Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque | Sarah Masud Preum

Multimodal hateful content detection is a challenging task that requires complex reasoning across visual and textual modalities. Therefore, creating a meaningful multimodal representation that effectively captures the interplay between visual and textual features through intermediate fusion is critical. Conventional fusion techniques are unable to attend to the modality-specific features effectively. Moreover, most studies exclusively concentrated on English and overlooked other low-resource languages. This paper proposes a context-aware attention framework for multimodal hateful content detection and assesses it for both English and non-English languages. The proposed approach incorporates an attention layer to meaningfully align the visual and textual features. This alignment enables selective focus on modality-specific features before fusing them. We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English). Evaluation results demonstrate our proposed approach’s effectiveness with F1-scores of 69.7% and 70.3% for the MUTE and MultiOFF datasets. The scores show approximately 2.5% and 3.2% performance improvement over the state-of-the-art systems on these datasets. Our implementation is available at https://github.com/eftekhar-hossain/Bengali-Hateful-Memes.

pdf bib abs
Topic-guided Example Selection for Domain Adaptation in LLM-based Machine Translation
Seth Aycock | Rachel Bawden

Current machine translation (MT) systems perform well in the domains on which they were trained, but adaptation to unseen domains remains a challenge. Rather than fine-tuning on domain data or modifying the architecture for training, an alternative approach exploits large language models (LLMs), which are performant across NLP tasks especially when presented with in-context examples. We focus on adapting a pre-trained LLM to a domain at inference through in-context example selection. For MT, examples are usually randomly selected from a development set. Some more recent methods though select using the more intuitive basis of test source similarity. We employ topic models to select examples based on abstract semantic relationships below the level of a domain. We test the relevance of these statistical models and use them to select informative examples even for out-of-domain inputs, experimenting on 7 diverse domains and 11 language pairs of differing resourcedness. Our method outperforms baselines on challenging multilingual out-of-domain tests, though it does not match performance with strong baselines for the in-language setting. We find that adding few-shot examples and related keywords consistently improves translation quality, that example diversity must be balanced with source similarity, and that our pipeline is overly restrictive for example selection when a targeted development set is available.

pdf bib abs
Reforging : A Method for Constructing a Linguistically Valid Japanese CCG Treebank
Asa Tomita | Hitomi Yanaka | Daisuke Bekki

The linguistic validity of Combinatory Categorial Grammar (CCG) parsing results relies heavily on treebanks for training and evaluation, so the treebank construction is crucial. Yet the current Japanese CCG treebank is known to have inaccuracies in its analyses of Japanese syntactic structures, including passive and causative constructions. While ABCTreebank, a treebank for ABC grammar, has been made to improve the analysis, particularly of argument structures, it lacks the detailed syntactic features required for Japanese CCG. In contrast, the Japanese CCG parser, lightblue, efficiently provides detailed syntactic features, but it does not accurately capture argument structures. We propose a method to generate a linguistically valid Japanese CCG treebank with detailed information by combining the strengths of ABCTreebank and lightblue. We develop an algorithm that filters lightblue’s lexical items using ABCTreebank, effectively converting lightblue output into a linguistically valid CCG treebank. To evaluate our treebank, we manually evaluate CCG syntactic structures and semantic representations and analyze conversion rates.

pdf bib abs
Thesis Proposal: Detecting Agency Attribution
Igor Ryazanov | Johanna Björklund

We explore computational methods for perceived agency attribution in natural language data. We consider ‘agency’ as the freedom and capacity to act, and the corresponding Natural Language Processing (NLP) task involves automatically detecting attributions of agency to entities in text. Our theoretical framework draws on semantic frame analysis, role labelling and related techniques. In initial experiments, we focus on the perceived agency of AI systems. To achieve this, we analyse a dataset of English-language news coverage of AI-related topics, published within one year surrounding the release of the Large Language Model-based service ChatGPT, a milestone in the general public’s awareness of AI. Building on this, we propose a schema to annotate a dataset for agency attribution and formulate additional research questions to answer by applying NLP models.

pdf bib abs
A Thesis Proposal ClaimInspector Framework: A Hybrid Approach to Data Annotation using Fact-Checked Claims and LLMs
Basak Bozkurt

This thesis explores the challenges and limitations encountered in automated fact-checking processes, with a specific emphasis on data annotation in the context of misinformation. Despite the widespread presence of misinformation in multiple formats and across various channels, current efforts concentrate narrowly on textual claims sourced mainly from Twitter, resulting in datasets with considerably limited scope. Furthermore, the absence of automated control measures, coupled with the reliance on human annotation, which is very limited, increases the risk of noisy data within these datasets. This thesis proposal examines the existing methods, elucidates their limitations and explores the potential integration of claim detection subtasks and Large Language Models to mitigate these issues. It introduces ClaimInspector, a novel framework designed for a systemic collection of multimodal data from the internet. By implementing this framework, this thesis will propose a dataset comprising fact-checks alongside the corresponding claims made by politicians. Overall, this thesis aims to enhance the accuracy and efficiency of annotation processes, thereby contributing to automated fact-checking efforts.

Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.

pdf bib abs
Representation and Generation of Machine Learning Test Functions
Souha Hassine | Steven Wilson

Writing tests for machine learning (ML) code is a crucial step towards ensuring the correctness and reliability of ML software. At the same time, Large Language Models (LLMs) have been adopted at a rapid pace for various code generation tasks, making it a natural choice for many developers who need to write ML tests. However, the implications of using these models, and how the LLM-generated tests differ from human-written ones, are relatively unexplored. In this work, we examine the use of LLMs to extract representations of ML source code and tests in order to understand the semantic relationships between human-written test functions and LLM-generated ones, and annotate a set of LLM-generated tests for several important qualities including usefulness, documentation, and correctness. We find that programmers prefer LLM-generated tests to those selected using retrieval-based methods, and in some cases, to those written by other humans.

pdf bib abs
The Generative AI Paradox in Evaluation: “What It Can Solve, It May Not Evaluate”
Juhyun Oh | Eunsu Kim | Inha Cha | Alice Oh

This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of “the Generative AI Paradox” (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.

pdf bib abs
Generative Data Augmentation using LLMs improves Distributional Robustness in Question Answering
Arijit Chowdhury | Aman Chadha

Robustness in Natural Language Processing continues to be a pertinent issue, where state of the art models under-perform under naturally shifted distributions. In the context of Question Answering, work on domain adaptation methods continues to be a growing body of research. However, very little attention has been given to the notion of domain generalization under natural distribution shifts, where the target domain is unknown. With drastic improvements in the quality and access to generative models, we answer the question: How do generated datasets influence the performance of QA models under natural distribution shifts? We perform experiments on 4 different datasets under varying amounts of distribution shift, and analyze how “in-the-wild” generation can help achieve domain generalization. We take a two-step generation approach, generating both contexts and QA pairs to augment existing datasets. Through our experiments, we demonstrate how augmenting reading comprehension datasets with generated data leads to better robustness towards natural distribution shifts.

This paper proposes the task of automatic assessment of Sentence Translation Exercises (STEs), that have been used in the early stage of L2 language learning.We formalize the task as grading student responses for each rubric criterion pre-specified by the educators.We then create a dataset for STE between Japanese and English including 21 questions, along with a total of 3,498 student responses (167 on average).The answer responses were collected from students and crowd workers.Using this dataset, we demonstrate the performance of baselines including a finetuned BERT model and GPT-3.5 with few-shot learning. Experimental results showed that the baseline model with fine-tuned BERT was able to classify correct responses with approximately 90% in F₁, but only less than 80% for incorrect responses. Furthermore, GPT-3.5 with few-shot learning shows a poorer result than the BERT model, indicating that our newly proposed task presents a challenging issue, even for the state-of-the-art large language model.

pdf bib abs
The Impact of Integration Step on Integrated Gradients
Masahiro Makino | Yuya Asazuma | Shota Sasaki | Jun Suzuki

Integrated Gradients (IG) serve as a potent tool for explaining the internal structure of a language model. The calculation of IG requires numerical integration, wherein the number of steps serves as a critical hyperparameter. The step count can drastically alter the results, inducing considerable errors in interpretability. To scrutinize the effect of step variation on IG, we measured the difference between theoretical and observed IG totals for each step amount.Our findings indicate that the ideal number of steps to maintain minimal error varies from instance to instance. Consequently, we advocate for customizing the step count for each instance. Our study is the first to quantitatively analyze the variation of IG values with the number of steps.

pdf bib abs
GesNavi: Gesture-guided Outdoor Vision-and-Language Navigation
Aman Jain | Teruhisa Misu | Kentaro Yamada | Hitomi Yanaka

Vision-and-Language Navigation (VLN) task involves navigating mobility using linguistic commands and has application in developing interfaces for autonomous mobility. In reality, natural human communication also encompasses non-verbal cues like hand gestures and gaze. These gesture-guided instructions have been explored in Human-Robot Interaction systems for effective interaction, particularly in object-referring expressions. However, a notable gap exists in tackling gesture-based demonstrative expressions in outdoor VLN task. To address this, we introduce a novel dataset for gesture-guided outdoor VLN instructions with demonstrative expressions, designed with a focus on complex instructions requiring multi-hop reasoning between the multiple input modalities. In addition, our work also includes a comprehensive analysis of the collected data and a comparative evaluation against the existing datasets.

pdf bib abs
Can docstring reformulation with an LLM improve code generation?
Nicola Dainese | Alexander Ilin | Pekka Marttinen

Generating code is an important application of Large Language Models (LLMs) and the task of function completion is one of the core open challenges in this context. Existing approaches focus on either training, fine-tuning or prompting LLMs to generate better outputs given the same input. We propose a novel and complementary approach: to optimize part of the input, the docstring (summary of a function’s purpose and usage), via reformulation with an LLM, in order to improve code generation. We develop two baseline methods for optimizing code generation via docstring reformulation and test them on the original HumanEval benchmark and multiple curated variants which are made more challenging by realistically worsening the docstrings. Our results show that, when operating on docstrings reformulated by an LLM instead of the original (or worsened) inputs, the performance of a number of open-source LLMs does not change significantlyThis finding demonstrates an unexpected robustness of current open-source LLMs to the details of the docstrings. We conclude by examining a series of questions, accompanied by in-depth analyses, pertaining to the sensitivity of current open-source LLMs to the details in the docstrings, the potential for improvement via docstring reformulation and the limitations of the methods employed in this work.

pdf bib abs
Benchmarking Diffusion Models for Machine Translation
Yunus Demirag | Danni Liu | Jan Niehues

Diffusion models have recently shown great potential on many generative tasks.In this work, we explore diffusion models for machine translation (MT).We adapt two prominent diffusion-based text generation models, Diffusion-LM and DiffuSeq, to perform machine translation.As the diffusion models generate non-autoregressively (NAR),we draw parallels to NAR machine translation models.With a comparison to conventional Transformer-based translation models, as well as to the Levenshtein Transformer,an established NAR MT model,we show that the multimodality problem that limits NAR machine translation performance is also a challenge to diffusion models.We demonstrate that knowledge distillation from an autoregressive model improves the performance of diffusion-based MT.A thorough analysis on the translation quality of inputs of different lengths shows that the diffusion models struggle more on long-range dependencies than other models.

The advancement of generative Large Language Models (LLMs), capable of producing human-like texts, introduces challenges related to the authenticity of the text documents. This requires exploring potential forgery scenarios within the context of authorship attribution, especially in the literary domain. Particularly,two aspects of doubted authorship may arise in novels, as a novel may be imposed by a renowned author or include a copied writing style of a well-known novel. To address these concerns, we introduce Forged-GAN-BERT, a modified GANBERT-based model to improve the classification of forged novels in two data-augmentation aspects: via the Forged Novels Generator (i.e., ChatGPT) and the generator in GAN. Compared to other transformer-based models, the proposed Forged-GAN-BERT model demonstrates an improved performance with F1 scores of 0.97 and 0.71 for identifying forged novels in single-author and multi-author classification settings. Additionally, we explore different prompt categories for generating the forged novels to analyse the quality of the generated texts using different similarity distance measures, including ROUGE-1, Jaccard Similarity, Overlap Confident, and Cosine Similarity.

pdf bib abs
Thesis Proposal: Detecting Empathy Using Multimodal Language Model
Md Rakibul Hasan | Md Zakir Hossain | Aneesh Krishna | Shafin Rahman | Tom Gedeon

Empathy is crucial in numerous social interactions, including human-robot, patient-doctor, teacher-student, and customer-call centre conversations. Despite its importance, empathy detection in videos continues to be a challenging task because of the subjective nature of empathy and often remains under-explored. Existing studies have relied on scripted or semi-scripted interactions in text-, audio-, or video-only settings that fail to capture the complexities and nuances of real-life interactions. This PhD research aims to fill these gaps by developing a multimodal language model (MMLM) that detects empathy in audiovisual data. In addition to leveraging existing datasets, the proposed study involves collecting real-life interaction video and audio. This study will leverage optimisation techniques like neural architecture search to deliver an optimised small-scale MMLM. Successful implementation of this project has significant implications in enhancing the quality of social interactions as it enables real-time measurement of empathy and thus provides potential avenues for training for better empathy in interactions.

pdf bib abs
Toward Sentiment Aware Semantic Change Analysis
Roksana Goworek | Haim Dubossarsky

This student paper explores the potential of augmenting computational models of semantic change with sentiment information. It tests the efficacy of this approach on the English SemEval of Lexical Semantic Change and its associated historical corpora. We first establish the feasibility of our approach by demonstrating that existing models extract reliable sentiment information from historical corpora, and then validate that words that underwent semantic change also show greater sentiment change in comparison to historically stable words. We then integrate sentiment information into standard models of semantic change for individual words, and test if this can improve the overall performance of the latter, showing mixed results. This research contributes to our understanding of language change by providing the first attempt to enrich standard models of semantic change with additional information. It taps into the multifaceted nature of language change, that should not be reduced only to binary or scalar report of change, but adds additional dimensions to this change, sentiment being only one of these. As such, this student paper suggests novel directions for future work in integrating additional, more nuanced information of change and interpretation for finer-grained semantic change analysis.

pdf bib abs
Dynamic Task-Oriented Dialogue: A Comparative Study of Llama-2 and Bert in Slot Value Generation
Tiziano Labruna | Sofia Brenna | Bernardo Magnini

Recent advancements in instruction-based language models have demonstrated exceptional performance across various natural language processing tasks. We present a comprehensive analysis of the performance of two open-source language models, BERT and Llama-2, in the context of dynamic task-oriented dialogues. Focusing on the Restaurant domain and utilizing the MultiWOZ 2.4 dataset, our investigation centers on the models’ ability to generate predictions for masked slot values within text. The dynamic aspect is introduced through simulated domain changes, mirroring real-world scenarios where new slot values are incrementally added to a domain over time.This study contributes to the understanding of instruction-based models’ effectiveness in dynamic natural language understanding tasks when compared to traditional language models and emphasizes the significance of open-source, reproducible models in advancing research within the academic community.