Findings of the Association for Computational Linguistics: EACL 2024

Anthology ID:: 2024.findings-eacl
Month:: March
Year:: 2024
Address:: St. Julian’s, Malta
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2024.findings-eacl
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2024.findings-eacl.pdf

pdf bib
Findings of the Association for Computational Linguistics: EACL 2024
Yvette Graham | Matthew Purver

Fine-grained few-shot entity extraction in the chemical domain faces two unique challenges. First, compared with entity extraction tasks in the general domain, sentences from chemical papers usually contain more entities. Moreover, entity extraction models usually have difficulty extracting entities of long-tailed types. In this paper, we propose Chem-FINESE, a novel sequence-to-sequence (seq2seq) based few-shot entity extraction approach, to address these two challenges. Our Chem-FINESE has two components: a seq2seq entity extractor to extract named entities from the input sentence and a seq2seq self-validation module to reconstruct the original input sentence from extracted entities. Inspired by the fact that a good entity extraction system needs to extract entities faithfully, our new self-validation module leverages entity extraction results to reconstruct the original input sentence. Besides, we design a new contrastive loss to reduce excessive copying during the extraction process. Finally, we release ChemNER+, a new fine-grained chemical entity extraction dataset that is annotated by domain experts with the ChemNER schema. Experiments in few-shot settings with both ChemNER+ and CHEMET datasets show that our newly proposed framework has contributed up to 8.26% and 6.84% absolute F1-score gains respectively.

pdf bib abs
GPTs Are Multilingual Annotators for Sequence Generation Tasks
Juhwan Choi | Eunju Lee | Kyohoon Jin | YoungBin Kim

Data annotation is an essential step for constructing new datasets. However, the conventional approach of data annotation through crowdsourcing is both time-consuming and expensive. In addition, the complexity of this process increases when dealing with low-resource languages owing to the difference in the language pool of crowdworkers. To address these issues, this study proposes an autonomous annotation method by utilizing large language models, which have been recently demonstrated to exhibit remarkable performance. Through our experiments, we demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation. Additionally, we constructed an image captioning dataset using our approach and are committed to open this dataset for future study. We have opened our source code for further study and reproducibility.

pdf bib abs
Next Visit Diagnosis Prediction via Medical Code-Centric Multimodal Contrastive EHR Modelling with Hierarchical Regularisation
Heejoon Koo

Predicting next visit diagnosis using Electronic Health Records (EHR) is an essential task in healthcare, critical for devising proactive future plans for both healthcare providers and patients. Nonetheless, many preceding studies have not sufficiently addressed the heterogeneous and hierarchical characteristics inherent in EHR data, inevitably leading to sub-optimal performance. To this end, we propose NECHO, a novel medical code-centric multimodal contrastive EHR learning framework with hierarchical regularisation. First, we integrate multifaceted information encompassing medical codes, demographics, and clinical notes using a tailored network design and a pair of bimodal contrastive losses, all of which pivot around a medical codes representation. We also regularise modality-specific encoders using a parental level information in medical ontology to learn hierarchical structure of EHR data. A series of experiments on MIMIC-III data demonstrates effectiveness of our approach.

pdf bib abs
FlexiQA: Leveraging LLM’s Evaluation Capabilities for Flexible Knowledge Selection in Open-domain Question Answering
Yuhan Chen | Shuqi Li | Rui Yan

Nowadays, large language models (LLMs) have demonstrated their ability to be a powerful knowledge generator of generate-then-read paradigm for open-domain question answering (ODQA). However this new paradigm mainly suffers from the “hallucination” and struggles to handle time-sensitive issue because of its expensive knowledge update costs. On the other hand, retrieve-then-read, as a traditional paradigm, is more limited by the relevance of acquired knowledge to the given question. In order to combine the strengths of both paradigms, and overcome their respective shortcomings, we design a new pipeline called “FlexiQA”, in which we utilize the diverse evaluation capabilities of LLMs to select knowledge effectively and flexibly. First, given a question, we prompt a LLM as a discriminator to identify whether it is time-sensitive. For time-sensitive questions, we follow the retrieve-then-read paradigm to obtain the answer. For the non time-sensitive questions, we further prompt the LLM as an evaluator to select a better document from two perspectives: factuality and relevance. Based on the selected document, we leverage a reader to get the final answer. We conduct extensive experiments on three widely-used ODQA benchmarks, the experimental results fully confirm the effectiveness of our approach.

The recent advancements in the realm of Automatic Speech Recognition (ASR) post-processing have been primarily driven by sequence-to-sequence paradigms. Despite their effectiveness, these methods often demand substantial amounts of data, necessitating the expensive recruitment of phonetic transcription experts to rectify the erroneous outputs of ASR systems, thereby creating the desired training data. Back TranScription (BTS) alleviates this issue by generating ASR inputs from clean text via a Text-to-Speech (TTS) system. While initial studies on BTS exhibited promise, they were constrained by a limited dataset of just 200,000 sentence pairs, leaving the scalability of this method in question. In this study, we delve into the potential scalability of BTS. We introduce the “Hyper-BTS” dataset, a corpus approximately five times larger than that utilized in prior research. Additionally, we present innovative criteria for categorizing error types within ASR post-processing. This not only facilitates a more comprehensive qualitative analysis, which was absent in preceding studies, but also enhances the understanding of ASR error patterns. Our empirical results, both quantitative and qualitative, suggest that the enlarged scale of the Hyper-BTS dataset sufficiently addresses a vast majority of the ASR error categories. We make the Hyper-BTS dataset publicly available.

We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at https://parrot-tts.github.io/tts/

pdf bib abs
NavHint: Vision and Language Navigation Agent with a Hint Generator
Yue Zhang | Quan Guo | Parisa Kordjamshidi

The existing work on vision and language navigation mainly relies on navigation-related losses to establish the connection between vision and language modalities, neglecting aspects of helping the navigation agent build a deep understanding of the visual environment.In our work, we provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions.The hint generator assists the navigation agent in developing a global understanding of the visual environment. It directs the agent’s attention toward related navigation details, including the relevant sub-instruction, potential challenges in recognition and ambiguities in grounding, and the targeted viewpoint description. To train the hint generator, we construct a synthetic dataset based on landmarks in the instructions and visible and distinctive objects in the visual environment.We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics. The experimental results demonstrate that generating hints not only enhances the navigation performance but also helps improve the agent’s interpretability.

pdf bib abs
Text or Image? What is More Important in Cross-Domain Generalization Capabilities of Hate Meme Detection Models?
Piush Aggarwal | Jawar Mehrabanian | Weigang Huang | Özge Alacam | Torsten Zesch

This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection, presenting compelling findings. We provide evidence supporting the hypothesis that only the textual component of hateful memes enables the multimodal classifier to generalize across different domains, while the image component proves highly sensitive to a specific training dataset. The evidence includes demonstrations showing that hate-text classifiers perform similarly to hate-meme classifiers in a zero-shot setting. Simultaneously, the introduction of captions generated from images of memes to the hate-meme classifier worsens performance by an average F1 of 0.02. Through blackbox explanations, we identify a substantial contribution of the text modality (average of 83%), which diminishes with the introduction of meme’s image captions (52%). Additionally, our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with average ∆F1 of 0.18.

pdf bib abs
Where are we Still Split on Tokenization?
Rob Goot

Many Natural Language Processing (NLP) tasks are labeled on the token level, forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often considered to be a solved problem, gold tokenization iscommonly assumed. In this paper, we propose an efficient method fortokenization with subword-based language models, and reflect on the status ofperformance on the tokenization task by evaluating on 122 languages in 20different scripts. We show that our proposed model performs on par with thestate-of-the-art, and that tokenization performance is mainly dependent on theamount and consistency of annotated data. We conclude that besidesinconsistencies in the data and exceptional cases the task can be consideredsolved for Latin languages for in-dataset settings (>99.5 F1). However,performance is 0.75 F1 point lower on average for datasets in other scripts andperformance deteriorates in cross-dataset setups.

Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models’ pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models’ architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).

pdf bib abs
How Does In-Context Learning Help Prompt Tuning?
Simeng Sun | Yang Liu | Dan Iter | Chenguang Zhu | Mohit Iyyer

Fine-tuning large language models is becoming ever more impractical due to their rapidly-growing scale. This motivates the use of parameter-efficient adaptation methods such as prompt tuning (PT), which adds a small number of tunable embeddings to an otherwise frozen model, and in-context learning (ICL), in which demonstrations of the task are provided to the model in natural language without any additional training. Recently, (CITATION) propose “instruction prompt tuning” (IPT), which combines PT with ICL by concatenating a natural language demonstration with learned prompt embeddings. While all of these methods have proven effective on different tasks, how they interact with each other remains unexplored. In this paper, we empirically study when and how in-context examples improve prompt tuning by measuring the effectiveness of ICL, PT, and IPT on five text generation tasks with multiple base language models. We observe that (1) IPT does not always outperform PT, and in fact requires the in-context demonstration to be semantically similar to the test input to yield improvements; (2) PT is unstable and exhibits high variance, but combining PT and ICL (into IPT) consistently reduces variance across all five tasks; and(3) prompts learned for a specific source task via PT exhibit positive transfer when paired with in-context examples of a different target task. Our results offer actionable insights on choosing a suitable parameter-efficient adaptation method for a given task.

pdf bib abs
Large Language Models for Psycholinguistic Plausibility Pretesting
Samuel Amouyal | Aya Meltzer-Asscher | Jonathan Berant

In psycholinguistics, the creation of controlled materials is crucial to ensure that research outcomes are solely attributed to the intended manipulations and not influenced by extraneous factors. To achieve this, psycholinguists typically pretest linguistic materials, where a common pretest is to solicit plausibility judgments from human evaluators on specific sentences. In this work, we investigate whether Language Models (LMs) can be used to generate these plausibility judgements. We investigate a wide range of LMs across multiple linguistic structures and evaluate whether their plausibility judgements correlate with human judgements. We find that GPT-4 plausibility judgements highly correlate with human judgements across the structures we examine, whereas other LMs correlate well with humans on commonly used syntactic structures. We then test whether this correlation implies that LMs can be used instead of humans for pretesting. We find that when coarse-grained plausibility judgements are needed, this works well, but when fine-grained judgements are necessary, even GPT-4 does not provide satisfactory discriminative power.

pdf bib abs
Modeling Aspect Sentiment Coherency via Local Sentiment Aggregation
Heng Yang | Ke Li

Aspect sentiment coherency is an intriguing yet underexplored topic in the field of aspect-based sentiment classification. This concept reflects the common pattern where adjacent aspects often share similar sentiments. Despite its prevalence, current studies have not fully recognized the potential of modeling aspect sentiment coherency, including its implications in adversarial defense. To model aspect sentiment coherency, we propose a novel local sentiment aggregation (LSA) paradigm based on constructing a differential-weighted sentiment aggregation window. We have rigorously evaluated our model through experiments, and the results affirm the proficiency of LSA in terms of aspect coherency prediction and aspect sentiment classification. For instance, it outperforms existing models and achieves state-of-the-art sentiment classification performance across five public datasets. Furthermore, we demonstrate the promising ability of LSA in ABSC adversarial defense, thanks to its sentiment coherency modeling. To encourage further exploration and application of this concept, we have made our code publicly accessible. This will provide researchers with a valuable tool to delve into sentiment coherency modeling in future research.

pdf bib abs
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Saba Ahmadi | Aishwarya Agrawal

Recently, reference-free metrics such as CLIPScore (Hessel et al., 2021), UMIC (Lee et al., 2021), and PAC-S (Sarto et al., 2023) have been proposed for automatic reference-free evaluation of image captions. Our focus lies in evaluating the robustness of these metrics in scenarios that require distinguishing between two captions with high lexical overlap but very different meanings. Our findings reveal that despite their high correlation with human judgments, CLIPScore, UMIC, and PAC-S struggle to identify fine-grained errors. While all metrics exhibit strong sensitivity to visual grounding errors, their sensitivity to caption implausibility errors is limited. Furthermore, we found that all metrics are sensitive to variations in the size of image-relevant objects mentioned in the caption, while CLIPScore and PAC-S are also sensitive to the number of mentions of image-relevant objects in the caption. Regarding linguistic aspects of a caption, all metrics show weak comprehension of negation, and CLIPScore and PAC-S are insensitive to the structure of the caption to a great extent. We hope our findings will guide further improvements in reference-free evaluation of image captioning.

Simultaneous interpretation is an especially challenging form of translation because it requires converting speech from one language to another in real-time. Though prior work has relied on out-of-the-box machine translation metrics to evaluate interpretation data, we hypothesize that strategies common in high-quality human interpretations, such as summarization, may not be handled well by standard machine translation metrics. In this work, we examine both qualitatively and quantitatively four potential barriers to evaluation of interpretation: disfluency, summarization, paraphrasing, and segmentation. Our experiments reveal that, while some machine translation metrics correlate fairly well with human judgments of interpretation quality, much work is still needed to account for strategies of interpretation during evaluation. As a first step to address this, we develop a fine-tuned model for interpretation evaluation, and achieve better correlation with human judgments than the state-of-the-art machine translation metrics.

pdf bib abs
Inconsistent dialogue responses and how to recover from them
Mian Zhang | Lifeng Jin | Linfeng Song | Haitao Mi | Dong Yu

One critical issue for chat systems is to stay consistent about preferences, opinions, beliefs and facts of itself, which has been shown a difficult problem. In this work, we study methods to assess and bolster utterance consistency of chat systems. A dataset is first developed for studying the inconsistencies, where inconsistent dialogue responses, explanations of the inconsistencies, and recovery utterances are authored by annotators. This covers the life span of inconsistencies, namely introduction, understanding, and resolution. Building on this, we introduce a set of tasks centered on dialogue consistency, specifically focused on its detection and resolution. Our experimental findings indicate that our dataset significantly helps the progress in identifying and resolving conversational inconsistencies, and current popular large language models like ChatGPT which are good at resolving inconsistencies however still struggle with detection.

pdf bib abs
MUG: Interactive Multimodal Grounding on User Interfaces
Tao Li | Gang Li | Jingjie Zheng | Purple Wang | Yang Li

We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation—the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that iterative interaction significantly improves the absolute task completion by 18% over the entire test set and 31% over the challenging split. Our results lay the foundation for further investigation of the problem.

pdf bib abs
PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation
Nadav Benedek | Lior Wolf

With the proliferation of large pre-trained language models (PLMs), fine-tuning all model parameters becomes increasingly inefficient, particularly when dealing with numerous downstream tasks that entail substantial training and storage costs. Several approaches aimed at achieving parameter-efficient fine-tuning (PEFT) have been proposed. Among them, Low-Rank Adaptation (LoRA) stands out as an archetypal method, incorporating trainable rank decomposition matrices into each target module. Nevertheless, LoRA does not consider the varying importance of each layer. To address these challenges, we introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process, considering both the temporary magnitude of weights and the accumulated statistics of the input to any given layer. We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.

pdf bib abs
Revamping Multilingual Agreement Bidirectionally via Switched Back-translation for Multilingual Neural Machine Translation
Hongyuan Lu | Haoyang Huang | Dongdong Zhang | Furu Wei | Wai Lam

Despite the fact that multilingual agreement (MA) has shown its importance for multilingual neural machine translation (MNMT), current methodologies in the field have two shortages: (i) require parallel data between multiple language pairs, which is not always realistic and (ii) optimize the agreement in an ambiguous direction, which hampers the translation performance. We present Bidirectional Multilingual Agreement via Switched Back-translation (BMA-SBT), a novel and universal multilingual agreement framework for fine-tuning pre-trained MNMT models, which (i) exempts the need for aforementioned parallel data by using a novel method called switched BT that creates synthetic text written in another source language using the translation target and (ii) optimizes the agreement bidirectionally with the Kullback-Leibler Divergence loss. Experiments indicate that BMA-SBT clearly improves the strong baselines on the task of MNMT with three benchmarks: TED Talks, News, and Europarl. In-depth analyzes indicate that BMA-SBT brings additive improvements to the conventional BT method.

pdf bib abs
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
Peiqin Lin | Chengzhi Hu | Zheyu Zhang | Andre Martins | Hinrich Schuetze

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

This paper serves as a foundational step towards the development of a linguistically motivated and technically relevant evaluation suite for Greek NLP. We initiate this endeavor by introducing four expert-verified evaluation tasks, specifically targeted at natural language inference, word sense disambiguation (through example comparison or sense selection) and metaphor detection. More than language-adapted replicas of existing tasks, we contribute two innovations which will resonate with the broader resource and evaluation community. Firstly, our inference dataset is the first of its kind, marking not just one, but rather all possible inference labels, accounting for possible shifts due to e.g. ambiguity or polysemy. Secondly, we demonstrate a cost-efficient method to obtain datasets for under-resourced languages. Using ChatGPT as a language-neutral parser, we transform the Dictionary of Standard Modern Greek into a structured format, from which we derive the other three tasks through simple projections. Alongside each task, we conduct experiments using currently available state of the art machinery. Our experimental baselines affirm the challenging nature of our tasks and highlight the need for expedited progress in order for the Greek NLP ecosystem to keep pace with contemporary mainstream research.

pdf bib abs
A Comprehensive Evaluation of Inductive Reasoning Capabilities and Problem Solving in Large Language Models
Chen Bowen | Rune Sætre | Yusuke Miyao

Inductive reasoning is fundamental to both human and artificial intelligence. The inductive reasoning abilities of current Large Language Models (LLMs) are evaluated in this research.We argue that only considering induction of rules is too narrow and unrealistic, since inductive reasoning is usually mixed with other abilities, like rules application, results/rules validation, and updated information integration.We probed the LLMs with a set of designed symbolic tasks and found that even state-of-the-art (SotA) LLMs fail significantly, showing the inability of LLMs to perform these intuitively simple tasks.Furthermore, we found that perfect accuracy in a small-size problem does not guarantee the same accuracy in a larger-size version of the same problem, provoking the question of how we can assess the LLMs’ actual problem-solving capabilities.We also argue that Chain-of-Thought prompts help the LLMs by decomposing the problem-solving process, but the LLMs still learn limitedly.Furthermore, we reveal that few-shot examples assist LLM generalization in out-of-domain (OOD) cases, albeit limited. The LLM starts to fail when the problem deviates from the provided few-shot examples.

pdf bib abs
Towards efficient self-supervised representation learning in speech processing
Luis Lugo | Valentin Vielzeuf

Self-supervised learning has achieved impressive results in speech processing, but current models are computationally expensive, generating environmental concerns because of their high energy consumption. Therefore, we propose an efficient self-supervised approach to address high computational costs, using a single GPU during 24 to 48 hours of pretraining. The proposed approach combines linear, convolutional, and self-attention layers with several optimizations, including dynamic batching, flash attention, mixed-precision training, gradient accumulation, and acoustic feature extraction with input preprocessing. Computational cost estimations for our proposed model represent up to two orders of magnitude improvements in computational efficiency against existing speech models.

pdf bib abs
Improving Cross-Domain Low-Resource Text Generation through LLM Post-Editing: A Programmer-Interpreter Approach
Zhuang Li | Levon Haroutunian | Raj Tumuluri | Philip Cohen | Reza Haf

Post-editing has proven effective in improving the quality of text generated by large language models (LLMs) such as GPT-3.5 or GPT-4, particularly when direct updating of their parameters to enhance text quality is infeasible or expensive. However, relying solely on smaller language models for post-editing can limit the LLMs’ ability to generalize across domains. Moreover, the editing strategies in these methods are not optimally designed for text generation tasks. To address these limitations, we propose a neural programmer-interpreter approach that preserves the domain generalization ability of LLMs while editing their output. The editing actions in this framework are specifically devised for text generation. Extensive experiments demonstrate that the programmer-interpreter significantly enhances GPT-3.5’s performance in logical form-to-text conversion and low-resource machine translation, surpassing other state-of-the-art (SOTA) LLM post-editing methods in cross-domain settings.

pdf bib abs
Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition
Tu Nguyen | Nedim Šrndić | Alexander Neth

Techniques, Tactics and Procedures (TTP) mapping is an important and difficult task in the application of cyber threat intelligence (CTI) extraction for threat reports. TTPs are typically expressed in semantic forms within security knowledge bases like MITRE ATT&CK, serving as textual high-level descriptions for sophisticated attack patterns. Conversely, attacks in CTI threat reports are detailed in a combination of natural and technical language forms, presenting a significant challenge even for security experts to establish correlations or mappings with the corresponding TTPs.Conventional learning approaches often target the TTP mapping problem in the classical multiclass/label classification setting. This setting hinders the learning capabilities of the model, due to the large number of classes (i.e., TTPs), the inevitable skewness of the label distribution and the complex hierarchical structure of the label space. In this work, we approach the problem in a different learning paradigm, such that the assignment of a text to a TTP label is essentially decided by the direct semantic similarity between the two, thus, reducing the complexity of competing solely over the large labeling space. In order that, we propose a neural matching architecture that incorporates a sampling based learn-to-compare mechanism, facilitating the learning process of the matching model despite constrained resources.

pdf bib abs
Large Language Models for Scientific Information Extraction: An Empirical Study for Virology
Mahsa Shamsabadi | Jennifer D’Souza | Sören Auer

In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel automated approach leverages the robust text generation capabilities of LLMs to produce structured scholarly contribution summaries, offering both a practical solution and insights into LLMs’ emergent abilities.For LLMs, the prime focus is on improving their general intelligence as conversational agents. We argue that these models can also be applied effectively in information extraction (IE), specifically in complex IE tasks within terse domains like Science. This paradigm shift replaces the traditional modular, pipelined machine learning approach with a simpler objective expressed through instructions. Our results show that finetuned FLAN-T5 with 1000x fewer parameters than the state-of-the-art GPT-davinci is competitive for the task.

pdf bib abs
Re3val: Reinforced and Reranked Generative Retrieval
EuiYul Song | Sangryul Kim | Haeju Lee | Joonkee Kim | James Thorne

Generative retrieval models encode pointers to information in a corpus as an index within the model’s parameters. These models serve as part of a larger pipeline, where retrieved information conditions generation for knowledge-intensive NLP tasks. However, we identify two limitations: the generative retrieval does not account for contextual information. Secondly, the retrieval can’t be tuned for the downstream readers as decoding the page title is a non-differentiable operation. This paper introduces Re3val, trained with generative reranking and reinforcement learning using limited data. Re3val leverages context acquired via Dense Passage Retrieval to rerank the retrieved page titles and utilizes REINFORCE to maximize rewards generated by constrained decoding. Additionally, we generate questions from our pre-training dataset to mitigate epistemic uncertainty and bridge the domain gap between the pre-training and fine-tuning datasets. Subsequently, we extract and rerank contexts from the KILT database using the rerank page titles. Upon grounding the top five reranked contexts, Re3val demonstrates the Top 1 KILT scores compared to all other generative retrieval models across five KILT datasets.

pdf bib abs
Entity Linking in the Job Market Domain
Mike Zhang | Rob Goot | Barbara Plank

In Natural Language Processing, entity linking (EL) has centered around Wikipedia, but yet remains underexplored for the job market domain. Disambiguating skill mentions can help us get insight into the current labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Previous efforts linked coarse-grained (full) sentences to a corresponding ESCO skill. In this work, we link more fine-grained span-level mentions of skills. We tune two high-performing neural EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et al., 2021), on a synthetically generated mention–skill pair dataset and evaluate them on a human-annotated skill-linking benchmark. Our findings reveal that both models are capable of linking implicit mentions of skills to their correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict evaluation, but GENRE performs better in loose evaluation (accuracy@k).

pdf bib abs
(Chat)GPT v BERT Dawn of Justice for Semantic Change Detection
Francesco Periti | Haim Dubossarsky | Nina Tahmasebi

In the universe of Natural Language Processing, Transformer-based language models like BERT and (Chat)GPT have emerged as lexical superheroes with great power to solve open research problems. In this paper, we specifically focus on the temporal problem of semantic change, and evaluate their ability to solve two diachronic extensions of the Word-in-Context (WiC) task: TempoWiC and HistoWiC. In particular, we investigate the potential of a novel, off-the-shelf technology like ChatGPT (and GPT) 3.5 compared to BERT, which represents a family of models that currently stand as the state-of-the-art for modeling semantic change. Our experiments represent the first attempt to assess the use of (Chat)GPT for studying semantic change. Our results indicate that ChatGPT performs significantly worse than the foundational GPT version. Furthermore, our results demonstrate that (Chat)GPT achieves slightly lower performance than BERT in detecting long-term changes but performs significantly worse in detecting short-term changes.

pdf bib abs
Towards Unified Uni- and Multi-modal News Headline Generation
Mateusz Krubiński | Pavel Pecina

Thanks to the recent progress in vision-language modeling and the evolving nature of news consumption, the tasks of automatic summarization and headline generation based on multimodal news articles have been gaining popularity. One of the limitations of the current approaches is caused by the commonly used sophisticated modular architectures built upon hierarchical cross-modal encoders and modality-specific decoders, which restrict the model’s applicability to specific data modalities – once trained on, e.g., text+video pairs there is no straightforward way to apply the model to text+image or text-only data. In this work, we propose a unified task formulation that utilizes a simple encoder-decoder model to generate headlines from uni- and multi-modal news articles. This model is trained jointly on data of several modalities and extends the textual decoder to handle the multimodal output.

The ability of Large Language Models (LLMs) to encode syntactic and semantic structures of language is well examined in NLP. Additionally, analogy identification, in the form of word analogies are extensively studied in the last decade of language modeling literature. In this work we specifically look at how LLMs’ abilities to capture sentence analogies (sentences that convey analogous meaning to each other) vary with LLMs’ abilities to encode syntactic and semantic structures of sentences. Through our analysis, we find that LLMs’ ability to identify sentence analogies is positively correlated with their ability to encode syntactic and semantic structures of sentences. Specifically, we find that the LLMs which capture syntactic structures better, also have higher abilities in identifying sentence analogies.

pdf bib abs
Contextualization Distillation from Large Language Model for Knowledge Graph Completion
Dawei Li | Zhen Tan | Tianlong Chen | Huan Liu

While textual information significantly enhances the performance of pre-trained language models (PLMs) in knowledge graph completion (KGC), the static and noisy nature of existing corpora collected from Wikipedia articles or synsets definitions often limits the potential of PLM-based KGC models. To surmount these challenges, we introduce the Contextualization Distillation strategy, a versatile plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models (LLMs) to transform compact, structural triplets into context-rich segments. Subsequently, we introduce two tailored auxiliary tasks—reconstruction and contextualization—allowing smaller KGC models to assimilate insights from these enriched triplets. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach, revealing consistent performance enhancements irrespective of underlying pipelines or architectures. Moreover, our analysis makes our method more explainable and provides insight into how to generate high-quality corpora for KGC, as well as the selection of suitable distillation tasks.

pdf bib abs
Differentially Private Natural Language Models: Recent Advances and Future Directions
Lijie Hu | Ivan Habernal | Lei Shen | Di Wang

Recent developments in deep learning have led to great success in various natural language processing (NLP) tasks. However, these applications may involve data that contain sensitive information. Therefore, how to achieve good performance while also protecting the privacy of sensitive data is a crucial challenge in NLP. To preserve privacy, Differential Privacy (DP), which can prevent reconstruction attacks and protect against potential side knowledge, is becoming a de facto technique for private data analysis. In recent years, NLP in DP models (DP-NLP) has been studied from different perspectives, which deserves a comprehensive review. In this paper, we provide the first systematic review of recent advances in DP deep learning models in NLP. In particular, we first discuss some differences and additional challenges of DP-NLP compared with the standard DP deep learning. Then, we investigate some existing work on DP-NLP andpresent its recent developments from three aspects: gradient perturbation based methods, embedding vector perturbation based methods, and ensemble model based methods. We also discuss some challenges and future directions.

pdf bib abs
Learning to Compare Financial Reports for Financial Forecasting
Ross Koval | Nicholas Andrews | Xifeng Yan

Public companies in the US are required to publish annual reports that detail their recent financial performance, present the current state of ongoing business operations, and discuss future prospects. However, they typically contain over 25,000 words across all sections, large amounts of industry and legal jargon, and a high percentage of boilerplate content that does not change much year-to-year. These unique characteristics present challenges for many generic pretrained language models because it is likely that only a small percentage of the long report that reflects salient information contains meaningful signal about the future prospects of the company. In this work, we curate a large-scale dataset of paired financial reports and introduce two novel, challenging tasks of predicting long-horizon company risk and correlation that evaluate the ability of the model to recognize cross-document relationships with complex, nuanced signals. We explore and present a comprehensive set of methods and experiments, and establish strong baselines designed to learn to identify subtle similarities and differences between long documents. Furthermore, we demonstrate that it is possible to predict company risk and correlation solely from the text of their financial reports and further that modeling the cross-document interactions at a fine-grained level provides significant benefit. Finally, we probe the best performing model through quantitative and qualitative interpretability methods to reveal some insight into the underlying task signal.

Geoparsing is a fundamental technique for analyzing geo-entity information in text, which is useful for geographic applications, e.g., tourist spot recommendation. We focus on document-level geoparsing that considers geographic relatedness among geo-entity mentions and present a Japanese travelogue dataset designed for training and evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.

pdf bib abs
Knowledge Generation for Zero-shot Knowledge-based VQA
Rui Cao | Jing Jiang

Previous solutions to knowledge-based visual question answering (K-VQA) retrieve knowledge from external knowledge bases and use supervised learning to train the K-VQA model.Recently pre-trained LLMs have been used as both a knowledge source and a zero-shot QA model for K-VQA and demonstrated promising results.However, these recent methods do not explicitly show the knowledge needed to answer the questions and thus lack interpretability.Inspired by recent work on knowledge generation from LLMs for text-based QA, in this work we propose and test a similar knowledge-generation-based K-VQA method, which first generates knowledge from an LLM and then incorporates the generated knowledge for K-VQA in a zero-shot manner. We evaluate our method on two K-VQA benchmarks and found that our method performs better than previous zero-shot K-VQA methods and our generated knowledge is generally relevant and helpful.

pdf bib abs
Simple Temperature Cool-down in Contrastive Framework for Unsupervised Sentence Representation Learning
Yoo Hyun Jeong | Myeong Soo Han | Dong-Kyu Chae

In this paper, we proposes a simple, tricky method to improve sentence representation of unsupervised contrastive learning. Even though contrastive learning has achieved great performances in both visual representation learning (VRL) and sentence representation learning (SRL) fields, we focus on the fact that there is a gap between characteristics and training dynamics of VRL and SRL. We first examine the role of temperature to bridge the gap between VRL and SRL, and find some temperature-dependent elements in SRL; i.e., a higher temperature causes overfitting of the uniformity while improving the alignment in earlier phase of training. Then, we design a temperature cool-down technique based on this observation, which helps PLMs to be more suitable for contrastive learning via preparation of uniform representation space. Our experimental results on widely-utilized benchmarks demonstrate the effectiveness and extensiblity of our method.

pdf bib abs
Bootstrap Your Own PLM: Boosting Semantic Features of PLMs for Unsuperivsed Contrastive Learning
Yoo Hyun Jeong | Myeong Soo Han | Dong-Kyu Chae

This paper aims to investigate the possibility of exploiting original semantic features of PLMs (pre-trained language models) during contrastive learning in the context of SRL (sentence representation learning). In the context of feature modification, we identified a method called IFM (implicit feature modification), which reduces the tendency of contrastive models for VRL (visual representation learning) to rely on feature-suppressing shortcut solutions. We observed that IFM did not work well for SRL, which may be due to differences between the nature of VRL and SRL. We propose BYOP, which boosts well-represented features, taking the opposite idea of IFM, under the assumption that SimCSE’s dropout-noise-based augmentation may be too simple to modify high-level semantic features, and that the features learned by PLMs are semantically meaningful and should be boosted, rather than removed. Extensive experiments lend credence to the logic of BYOP, which considers the nature of SRL.

pdf bib abs
Personalized Abstractive Summarization by Tri-agent Generation Pipeline
Wen Xiao | Yujia Xie | Giuseppe Carenini | Pengcheng He

Tailoring outputs from large language models, like ChatGPT, to implicit user preferences remains a challenge despite their impressive generative capabilities. In this paper, we propose a tri-agent generation pipeline comprising a generator, an instructor, and an editor to enhance output personalization. The generator produces an initial output, the instructor automatically generates editing instructions based on user preferences, and the editor refines the output to align with those preferences. The inference-only large language model (ChatGPT) serves as both the generator and editor, with a smaller model acting as the instructor to guide output generation. We train the instructor using editor-steered reinforcement learning, leveraging feedback from a large-scale editor model to optimize instruction generation. Experimental results on two abstractive summarization datasets demonstrate the effectiveness of our approach in generating outputs that better meet user expectations.

pdf bib abs
Revisiting the Markov Property for Machine Translation
Cunxiao Du | Hao Zhou | Zhaopeng Tu | Jing Jiang

In this paper, we re-examine the Markov property in the context of neural machine translation. We design a Markov Autoregressive Transformer (MAT) and undertake a comprehensive assessment of its performance across four WMT benchmarks. Our findings indicate that MAT with an order larger than 4 can generate translations with quality on par with that of conventional autoregressive transformers. In addition, counter-intuitively, we also find that the advantages of utilizing a higher-order MAT do not specifically contribute to the translation of longer sentences.

pdf bib abs
Reward Engineering for Generating Semi-structured Explanation
Jiuzhou Han | Wray Buntine | Ehsan Shareghi

Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is utilised and supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify a model’s true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs (e.g., FLAN-T5-XXL). In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed method on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.

pdf bib abs
Towards Context-Based Violence Detection: A Korean Crime Dialogue Dataset
Minju Kim | Heuiyeen Yeen | Myoung-Wan Koo

In order to enhance the security of society, there is rising interest in artificial intelligence (AI) to help detect and classify in advanced violence in daily life. The field of violence detection has introduced various datasets, yet context-based violence detection predominantly focuses on vision data, with a notable lack of NLP datasets. To overcome this, this paper presents the first Korean dialogue dataset for classifying violence that occurs in online settings: the Korean Crime Dialogue Dataset (KCDD). KCDD contains 22,249 dialogues created by crowd workers assuming offline scenarios. It has four criminal classes that meet international legal standards and one clean class (Serious Threats, Extortion or Blackmail, Harassment in the Workplace, Other Harassment, and Clean Dialogue). Plus, we propose a strong baseline for the proposed dataset, Relationship-Aware BERT. The model shows that understanding varying relationships among interlocutors improves the performance of crime dialogue classification. We hope that the proposed dataset will be used to detect cases of violence and aid people in danger. The KCDD dataset and corresponding baseline implementations can be found at the following link: https://sites.google.com/view/kcdd.

pdf bib abs
Capturing the Relationship Between Sentence Triplets for LLM and Human-Generated Texts to Enhance Sentence Embeddings
Na Min An | Sania Waheed | James Thorne

Deriving meaningful sentence embeddings is crucial in capturing the semantic relationship between texts. Recent advances in building sentence embedding models have centered on replacing traditional human-generated text datasets with those generated by LLMs. However, the properties of these widely used LLM-generated texts remain largely unexplored. Here, we evaluate the quality of the LLM-generated texts from four perspectives (Positive Text Repetition, Length Difference Penalty, Positive Score Compactness, and Negative Text Implausibility) and find that there exists an inherent difference between human and LLM-generated datasets. To further enhance sentence embeddings using both human and LLM-generated datasets, we propose a novel loss function that incorporates Positive-Negative sample Augmentation (PNA) within the contrastive learning objective. Our results demonstrate that PNA effectively mitigates the sentence anisotropy problem in Wikipedia corpus (-7% compared to CLHAIF) and simultaneously improves the Spearman’s correlation in standard Semantic Textual Similarity (STS) tasks (+1.47% compared to CLHAIF).

pdf bib abs
Harmonizing Code-mixed Conversations: Personality-assisted Code-mixed Response Generation in Dialogues
Shivani Kumar | Tanmoy Chakraborty

Code-mixing, the blending of multiple languages within a single conversation, introduces a distinctive challenge, particularly in the context of response generation. Capturing the intricacies of code-mixing proves to be a formidable task, given the wide-ranging variations influenced by individual speaking styles and cultural backgrounds. In this study, we explore response generation within code-mixed conversations. We introduce a novel approach centered on harnessing the Big Five personality traits acquired in an unsupervised manner from the conversations to bolster the performance of response generation. These inferred personality attributes are seamlessly woven into the fabric of the dialogue context, using a novel fusion mechanism, . It uses an effective two-step attention formulation to fuse the dialogue and personality information. This fusion not only enhances the contextual relevance of generated responses but also elevates the overall performance of the model. Our experimental results, grounded in a dataset comprising of multi-party Hindi-English code-mix conversations, highlight the substantial advantages offered by personality-infused models over their conventional counterparts. This is evident in the increase observed in ROUGE and BLUE scores for the response generation task when the identified personality is seamlessly integrated into the dialogue context. Qualitative assessment for personality identification and response generation aligns well with our quantitative results.

pdf bib abs
Morality is Non-Binary: Building a Pluralist Moral Sentence Embedding Space using Contrastive Learning
Jeongwoo Park | Enrico Liscio | Pradeep Murukannaiah

Recent advances in NLP show that language models retain a discernible level of knowledge in deontological ethics and moral norms. However, existing works often treat morality as binary, ranging from right to wrong. This simplistic view does not capture the nuances of moral judgment. Pluralist moral philosophers argue that human morality can be deconstructed into a finite number of elements, respecting individual differences in moral judgment. In line with this view, we build a pluralist moral sentence embedding space via a state-of-the-art contrastive learning approach. We systematically investigate the embedding space by studying the emergence of relationships among moral elements, both quantitatively and qualitatively. Our results show that a pluralist approach to morality can be captured in an embedding space. However, moral pluralism is challenging to deduce via self-supervision alone and requires a supervised approach with human labels.

pdf bib abs
Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases
Giulio Zhou | Tsz Kin Lam | Alexandra Birch | Barry Haddow

Speech-to-Text Translation (S2TT) has typically been addressed with cascade systems, where speech recognition systems generate a transcription that is subsequently passed to a translation model. While there has been a growing interest in developing direct speech translation systems to avoid propagating errors and losing non-verbal content, prior work in direct S2TT has struggled to conclusively establish the advantages of integrating the acoustic signal directly into the translation process. This work proposes using contrastive evaluation to quantitatively measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role. Specifically, we evaluated Korean-English translation systems on a test set containing wh-phrases, for which prosodic features are necessary to produce translations with the correct intent, whether it’s a statement, a yes/no question, a wh-question, and more. Our results clearly demonstrate the value of direct translation systems over cascade translation models, with a notable 12.9% improvement in overall accuracy in ambiguous cases, along with up to a 15.6% increase in F1 scores for one of the major intent categories. To the best of our knowledge, this work stands as the first to provide quantitative evidence that direct S2TT models can effectively leverage prosody. The code for our evaluation is openly accessible and freely available for review and utilisation.

This paper aims to quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations such as temporal relations, causal relations, and discourse relations. Given ChatGPT’s promising performance across various tasks, we proceed to carry out thorough evaluations on the whole test sets of 11 datasets, including temporal and causal relations, PDTB2.0-based, and dialogue-based discourse relations. To ensure the reliability of our findings, we employ three tailored prompt templates for each task, including the zero-shot prompt template, zero-shot prompt engineering (PE) template, and in-context learning (ICL) prompt template, to establish the initial baseline scores for all popular sentence-pair relation classification tasks for the first time. Through our study, we discover that ChatGPT exhibits exceptional proficiency in detecting and reasoning about causal relations, albeit it may not possess the same level of expertise in identifying the temporal order between two events. While it is capable of identifying the majority of discourse relations with existing explicit discourse connectives, the implicit discourse relation remains a formidable challenge. Concurrently, ChatGPT demonstrates subpar performance in the dialogue discourse parsing task that requires structural understanding in a dialogue before being aware of the discourse relation.

pdf bib abs
Backtracing: Retrieving the Cause of the Query
Rose Wang | Pawan Wirawarn | Omar Khattab | Noah Goodman | Dorottya Demszky

Many online content portals allow users to ask questions to supplement their understanding (e.g., of lectures). While information retrieval (IR) systems may provide answers for such user queries, they do not directly assist content creators—such as lecturers who want to improve their content—identify segments that caused a user to ask those questions.We introduce the task of backtracing, in which systems retrieve the text segment that most likely caused a user query.We formalize three real-world domains for which backtracing is important in improving content delivery and communication: understanding the cause of (a) student confusion in the Lecture domain, (b) reader curiosity in the News Article domain, and (c) user emotion in the Conversation domain.We evaluate the zero-shot performance of popular information retrieval methods and language modeling methods, including bi-encoder, re-ranking and likelihood-based methods and ChatGPT.While traditional IR systems retrieve semantically relevant information (e.g., details on “projection matrices” for a query “does projecting multiple times still lead to the same point?”), they often miss the causally relevant context (e.g., the lecturer states “projecting twice gets me the same answer as one projection”). Our results show that there is room for improvement on backtracing and it requires new retrieval approaches.We hope our benchmark serves to improve future retrieval systems for backtracing, spawning systems that refine content generation and identify linguistic triggers influencing user queries.

pdf bib abs
Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling
Chao-Wei Huang | Chen-An Li | Tsu-Yuan Hsu | Chen-Yu Hsu | Yun-Nung Chen

Dense retrieval methods have demonstrated promising performance in multilingual information retrieval, where queries and documents can be in different languages. However, dense retrievers typically require a substantial amount of paired data, which poses even greater challenges in multilingual scenarios. This paper introduces UMR, an ̲Unsupervised ̲Multilingual dense ̲Retriever trained without any paired data. Our approach leverages the sequence likelihood estimation capabilities of multilingual language models to acquire pseudo labels for training dense retrievers. We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers. Experimental results on two benchmark datasets show that UMR outperforms supervised baselines, showcasing the potential of training multilingual retrievers without paired data, thereby enhancing their practicality. All of our source code, data, and models are available: https://github.com/MiuLab/UMR

pdf bib abs
Investigating grammatical abstraction in language models using few-shot learning of novel noun gender
Priyanka Sukumaran | Conor Houghton | Nina Kazanina

Humans can learn a new word and infer its grammatical properties from very few examples. They have an abstract notion of linguistic properties like grammatical gender and agreement rules that can be applied to novel syntactic contexts and words. Drawing inspiration from psycholinguistics, we conduct a noun learning experiment to assess whether an LSTM and a decoder-only transformer can achieve human-like abstraction of grammatical gender in French. Language models were tasked with learning the gender of a novel noun embedding from a few examples in one grammatical agreement context and predicting agreement in another, unseen context. We find that both language models effectively generalise novel noun gender from one to two learning examples and apply the learnt gender across agreement contexts, albeit with a bias for the masculine gender category. Importantly, the few-shot updates were only applied to the embedding layers, demonstrating that models encode sufficient gender information within the word-embedding space. While the generalisation behaviour of models suggests that they represent grammatical gender as an abstract category, like humans, further work is needed to explore the details of how exactly this is implemented. For a comparative perspective with human behaviour, we conducted an analogous one-shot novel noun gender learning experiment, which revealed that native French speakers, like language models, also exhibited a masculine gender bias and are not excellent one-shot learners either.

Data Augmentation (DA) is frequently used to provide additional training data without extra human annotation automatically.However, data augmentation may introduce noisy data that impairs training.To guarantee the quality of augmented data,existing methods either assume no noise exists in the augmented data and adopt consistency training or use simple heuristics such as training loss and diversity constraints to filter out “noisy” data.However, those filtered examples may still contain useful information, and dropping them completely causes a loss of supervision signals.In this paper, based on the assumption that the original dataset is cleaner than the augmented data, we propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.To further prevent overfitting on noisy labels, a simple self-regularization module is applied to force the model prediction to be consistent across two distinct dropouts.Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.

This research explores strategies for steering the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text generation. We show that style vectors can be simply computed from recorded layer activations for input texts in a specific style in contrast to more complex training-based approaches. Through a series of experiments, we demonstrate the effectiveness of activation engineering using such style vectors to influence the style of generated text in a nuanced and parameterisable way, distinguishing it from prompt engineering. The presented research constitutes a significant step towards developing more adaptive and effective AI-empowered interactive systems.

pdf bib abs
Consistent Joint Decision-Making with Heterogeneous Learning Models
Hossein Rajaby Faghihi | Parisa Kordjamshidi

This paper introduces a novel decision-making framework that promotes consistency among decisions made by diverse models while utilizing external knowledge. Leveraging the Integer Linear Programming(ILP) framework, we map predictions from various models into globally normalized and comparable values by incorporating information about decisions’ prior probability, confidence (uncertainty), and the models’ expected accuracy. Our empirical study demonstrates the superiority of our approach over conventional baselines on multiple datasets.

pdf bib abs
Quantifying Association Capabilities of Large Language Models and Its Implications on Privacy Leakage
Hanyin Shao | Jie Huang | Shen Zheng | Kevin Chang

The advancement of large language models (LLMs) brings notable improvements across various applications, while simultaneously raising concerns about potential private data exposure. One notable capability of LLMs is their ability to form associations between different pieces of information, but this raises concerns when it comes to personally identifiable information (PII). This paper delves into the association capabilities of language models, aiming to uncover the factors that influence their proficiency in associating information. Our study reveals that as models scale up, their capacity to associate entities/information intensifies, particularly when target pairs demonstrate shorter co-occurrence distances or higher co-occurrence frequencies. However, there is a distinct performance gap when associating commonsense knowledge versus PII, with the latter showing lower accuracy. Despite the proportion of accurately predicted PII being relatively small, LLMs still demonstrate the capability to predict specific instances of email addresses and phone numbers when provided with appropriate prompts. These findings underscore the potential risk to PII confidentiality posed by the evolving capabilities of LLMs, especially as they continue to expand in scale and power.

pdf bib abs
Probing Critical Learning Dynamics of PLMs for Hate Speech Detection
Sarah Masud | Mohammad Aflah Khan | Vikram Goyal | Md Shad Akhtar | Tanmoy Chakraborty

Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs’ use in hate speech detection. We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. We further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.

pdf bib abs
Embible: Reconstruction of Ancient Hebrew and Aramaic Texts Using Transformers
Niv Fono | Harel Moshayof | Eldar Karol | Itai Assraf | Mark Last

Hebrew and Aramaic inscriptions serve as an essential source of information on the ancient history of the Near East. Unfortunately, some parts of the inscribed texts become illegible over time. Special experts, called epigraphists, use time-consuming manual procedures to estimate the missing content. This problem can be considered an extended masked language modeling task, where the damaged content can comprise single characters, character n-grams (partial words), single complete words, and multi-word n-grams.This study is the first attempt to apply the masked language modeling approach to corrupted inscriptions in Hebrew and Aramaic languages, both using the Hebrew alphabet consisting mostly of consonant symbols. In our experiments, we evaluate several transformer-based models, which are fine-tuned on the Biblical texts and tested on three different percentages of randomly masked parts in the testing corpus. For any masking percentage, the highest text completion accuracy is obtained with a novel ensemble of word and character prediction models.

pdf bib abs
Stateful Memory-Augmented Transformers for Efficient Dialogue Modeling
Qingyang Wu | Zhou Yu

Transformer models have achieved great performance in dialogue generation tasks. However, their inability to process long dialogue history often leads to truncation of the context. To address this problem, we propose a novel memory-augmented transformer that is compatible with existing pre-trained encoder-decoder models and enables efficient preservation of the dialogue history information. The new model incorporates a separate memory module alongside the pre-trained transformer, which can effectively interchange information between the memory states and the current input context. We evaluate the efficiency of our model on three dialogue datasets and two language modeling datasets. Experimental results show that our method has achieved superior efficiency and performance compared to other pre-trained Transformer baselines.

In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. This fact is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.

Euphemisms are found across the world’s languages, making them a universal linguistic phenomenon. As such, euphemistic data may have useful properties for computational tasks across languages. In this study, we explore this premise by training a multilingual transformer model (XLM-RoBERTa) to disambiguate potentially euphemistic terms (PETs) in multilingual and cross-lingual settings. In line with current trends, we demonstrate that zero-shot learning across languages takes place. We also show cases where multilingual models perform better on the task compared to monolingual models by a statistically significant margin, indicating that multilingual data presents additional opportunities for models to learn about cross-lingual, computational properties of euphemisms. In a follow-up analysis, we focus on universal euphemistic “categories” such as death and bodily functions among others. We test to see whether cross-lingual data of the same domain is more important than within-language data of other domains to further understand the nature of the cross-lingual transfer.

pdf bib abs
PromptExplainer: Explaining Language Models through Prompt-based Learning
Zijian Feng | Hanzhang Zhou | Zixiao Zhu | Kezhi Mao

Pretrained language models have become workhorses for various natural language processing (NLP) tasks, sparking a growing demand for enhanced interpretability and transparency. However, prevailing explanation methods, such as attention-based and gradient-based strategies, largely rely on linear approximations, potentially causing inaccuracies such as accentuating irrelevant input tokens. To mitigate the issue, we develop PromptExplainer, a novel method for explaining language models through prompt-based learning. PromptExplainer aligns the explanation process with the masked language modeling (MLM) task of pretrained language models and leverages the prompt-based learning framework for explanation generation. It disentangles token representations into the explainable embedding space using the MLM head and extracts discriminative features with a verbalizer to generate class-dependent explanations. Extensive experiments demonstrate that PromptExplainer significantly outperforms state-of-the-art explanation methods.

pdf bib abs
Do-Not-Answer: Evaluating Safeguards in LLMs
Yuxia Wang | Haonan Li | Xudong Han | Preslav Nakov | Timothy Baldwin

With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to identify potential risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. Here we aim to facilitate this process. In particular, we collect an open-source dataset to evaluate the safeguards in LLMs, to facilitate the deployment of safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We assess the responses of six popular LLMs to these instructions, and we find that simple BERT-style classifiers can achieve results that are comparable to GPT-4 on automatic safety evaluation. Our data and code are available at https://github.com/Libr-AI/do-not-answer

pdf bib abs
Do Language Models Know When They’re Hallucinating References?
Ayush Agrawal | Mirac Suzgun | Lester Mackey | Adam Kalai

State-of-the-art language models (LMs) are notoriously susceptible to generating hallucinated information. Such inaccurate outputs not only undermine the reliability of these models but also limit their use and raise serious concerns about misinformation and propaganda. In this work, we focus on hallucinated book and article references and present them as the “model organism” of language model hallucination research, due to their frequent and easy-to-discern nature. We posit that if a language model cites a particular reference in its output, then it should ideally possess sufficient information about its authors and content, among other relevant details. Using this basic insight, we illustrate that one can identify hallucinated references without ever consulting any external resources, by asking a set of direct or indirect queries to the language model about the references. These queries can be considered as “consistency checks.” Our findings highlight that while LMs, including GPT-4, often produce inconsistent author lists for hallucinated references, they also often accurately recall the authors of real references. In this sense, the LM can be said to “know” when it is hallucinating references. Furthermore, these findings show how hallucinated references can be dissected to shed light on their nature.

pdf bib abs
Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys
Yong Cao | Min Chen | Daniel Hershcovich

The cultural landscape of interactions with dialogue agents is a compelling yet relatively unexplored territory. It’s clear that various sociocultural aspects—from communication styles and beliefs to shared metaphors and knowledge—profoundly impact these interactions. To delve deeper into this dynamic, we introduce cuDialog, a first-of-its-kind benchmark for dialogue generation with a cultural lens. We also develop baseline models capable of extracting cultural attributes from dialogue exchanges, with the goal of enhancing the predictive accuracy and quality of dialogue agents. To effectively co-learn cultural understanding and multi-turn dialogue predictions, we propose to incorporate cultural dimensions with dialogue encoding features. Our experimental findings highlight that incorporating cultural value surveys boosts alignment with references and cultural markers, demonstrating its considerable influence on personalization and dialogue quality. To facilitate further exploration in this exciting domain, we publish our benchmark publicly accessible at https://github.com/yongcaoplus/cuDialog.

pdf bib abs
CEO: Corpus-based Open-Domain Event Ontology Induction
Nan Xu | Hongming Zhang | Jianshu Chen

Existing event-centric NLP models often only apply to the pre-defined ontology, which significantly restricts their generalization capabilities.This paper presents CEO, a novel Corpus-based Event Ontology induction model to relax the restriction imposed by pre-defined event ontologies. Without direct supervision, CEO leverages distant supervision from available summary datasets to detect corpus-wise salient events and exploits external event knowledge to force events within a short distance to have close embeddings. Experiments on three popular event datasets show that the schema induced by CEO has better coverage and higher accuracy than previous methods. Moreover, CEO is the first event ontology induction model that can induce a hierarchical event ontology with meaningful names on eleven open-domain corpora, making the induced schema more trustworthy and easier to be further curated. We anonymously release our dataset, codes, and induced ontology.

pdf bib abs
Rethinking STS and NLI in Large Language Models
Yuxia Wang | Minghan Wang | Preslav Nakov

Recent years, have seen the rise of large language models (LLMs), where practitioners use task-specific prompts; this was shown to be effective for a variety of tasks. However, when applied to semantic textual similarity (STS) and natural language inference (NLI), the effectiveness of LLMs turns out to be limited by low-resource domain accuracy, model overconfidence, and difficulty to capture the disagreements between human judgements. With this in mind, here we try to rethink STS and NLI in the era of LLMs. We first evaluate the performance of STS and NLI in the clinical/biomedical domain, and then we assess LLMs’ predictive confidence and their capability of capturing collective human opinions. We find that these old problems are still to be properly addressed in the era of LLMs.

pdf bib abs
Learning High-Quality and General-Purpose Phrase Representations
Lihu Chen | Gael Varoquaux | Fabian Suchanek

Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification.The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pre-trained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences.Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible.We propose an improved framework to learn phrase representations in a context-free fashion.The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation.Furthermore, we design three granularities of data augmentation to increase the diversity of training samples.Our experiments across a wide range of tasks reveal that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size.

pdf bib abs
Explaining Language Model Predictions with High-Impact Concepts
Ruochen Zhao | Tan Wang | Yongjie Wang | Shafiq Joty

To encourage fairness and transparency, there exists an urgent demand for deriving reliable explanations for large language models (LLMs). One promising solution is concept-based explanations, i.e., human-understandable concepts from internal representations. However, due to the compositional nature of languages, current methods mostly discover correlational explanations instead of causal features. Therefore, we propose a novel framework to provide impact-aware explanations for users to understand the LLM’s behavior, which are robust to feature changes and influential to the model’s predictions. Specifically, we extract predictive high-level features (concepts) from the model’s hidden layer activations. Then, we innovatively optimize for features whose existence causes the output predictions to change substantially. Extensive experiments on real and synthetic tasks demonstrate that our method achieves superior results on predictive impact, explainability, and faithfulness compared to the baselines, especially for LLMs.

pdf bib abs
Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis
Oscar Chew | Hsuan-Tien Lin | Kai-Wei Chang | Kuan-Hao Huang

Recent research has revealed that machine learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances. For instance, a sentiment classifier may erroneously learn that the token “performances” is commonly associated with positive movie reviews.Relying on these spurious correlations degrades the classifier’s performance when it deploys on out-of-distribution data.In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis. The analysis uncovers how spurious correlations lead unrelated words to erroneously cluster together in the embedding space. Driven by the analysis, we design a metric to detect spurious tokens and also propose a family of regularization methods, NFL (doN’t Forget your Language) to mitigate spurious correlations in text classification.Experiments show that NFL can effectively prevent erroneous clusters and significantly improve the robustness of classifiers without auxiliary data. The code is publicly available at https://github.com/oscarchew/doNt-Forget-your-Language.

Factual consistency detection has gotten raised attention in the task of abstractive summarization. Many existing works rely on synthetic training data, which may not accurately reflect or match the inconsistencies produced by summarization models. In this paper, we first systematically analyze the shortcomings of the current methods in synthesizing inconsistent summaries. Current synthesis methods may fail to produce inconsistencies of coreference errors and discourse errors, per our quantitative and qualitative study. Then, employing the parameter-efficient finetuning (PEFT) technique, we discover that a competitive factual consistency detector can be achieved using thousands of real model-generated summaries with human annotations. Our study demonstrates the importance of real machine-generated texts with human annotation in NLG evaluation as our model outperforms the SOTA on the CoGenSumm, FactCC, Frank, and SummEval datasets.

pdf bib abs
IndiVec: An Exploration of Leveraging Large Language Models for Media Bias Detection with Fine-Grained Bias Indicators
Luyang Lin | Lingzhi Wang | Xiaoyan Zhao | Jing Li | Kam-Fai Wong

This study focuses on media bias detection, crucial in today’s era of influential social media platforms shaping individual attitudes and opinions. In contrast to prior work that primarily relies on training specific models tailored to particular datasets, resulting in limited adaptability and subpar performance on out-of-domain data, we introduce a general bias detection framework, IndiVec, built upon large language models. IndiVec begins by constructing a fine-grained media bias database, leveraging the robust instruction-following capabilities of large language models and vector database techniques. When confronted with new input for bias detection, our framework automatically selects the most relevant indicator from the vector database and employs majority voting to determine the input’s bias label. IndiVec excels compared to previous methods due to its adaptability (demonstrating consistent performance across diverse datasets from various sources) and explainability (providing explicit top-k indicators to interpret bias predictions). Experimental results on four political bias datasets highlight IndiVec’s significant superiority over baselines. Furthermore, additional experiments and analysis provide profound insights into the framework’s effectiveness.

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top 20, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models’ outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators in enhancing multilingual evaluation by calibrating them against 20K human judgments across three text-generation tasks, five metrics, and eight languages. Our analysis reveals a bias in LLM-based evaluators towards higher scores, underscoring the necessity of calibration with native speaker judgments, especially in low-resource and non-Latin script languages, to ensure accurate evaluation of LLM performance across diverse languages.

pdf bib abs
Computational Morphology and Lexicography Modeling of Modern Standard Arabic Nominals
Christian Khairallah | Reham Marzouk | Salam Khalifa | Mayar Nassar | Nizar Habash

Modern Standard Arabic (MSA) nominals present many morphological and lexical modeling challenges that have not been consistently addressed previously. This paper attempts to define the space of such challenges, and leverage a recently proposed morphological framework to build a comprehensive and extensible model for MSA nominals. Our model design addresses the nominals’ intricate morphotactics, as well as their paradigmatic irregularities. Our implementation showcases enhanced accuracy and consistency compared to a commonly used MSA morphological analyzer and generator. We make our models publicly available.

pdf bib abs
Relabeling Minimal Training Subset to Flip a Prediction
Jinghan Yang | Linjie Xu | Lequan Yu

When facing an unsatisfactory prediction from a machine learning model, users can be interested in investigating the underlying reasons and exploring the potential for reversing the outcome. We ask: To flip the prediction on a test point x_t, how to identify the smallest training subset 𝒮_t that we need to relabel?We propose an efficient algorithm to identify and relabel such a subset via an extended influence function for binary classification models with convex loss.We find that relabeling fewer than 2% of the training points can always flip a prediction.This mechanism can serve multiple purposes: (1) providing an approach to challenge a model prediction by altering training points; (2) evaluating model robustness with the cardinality of the subset (i.e., |𝒮_t|); we show that |𝒮_t| is highly related to the noise ratio in the training set and |𝒮_t| is correlated with but complementary to predicted probabilities; and (3) revealing training points lead to group attribution bias. To the best of our knowledge, we are the first to investigate identifying and relabeling the minimal training subset required to flip a given prediction.

pdf bib abs
Why Generate When You Can Discriminate? A Novel Technique for Text Classification using Language Models
Sachin Pawar | Nitin Ramrakhiyani | Anubhav Sinha | Manoj Apte | Girish Palshikar

In this paper, we propose a novel two-step technique for text classification using autoregressive Language Models (LM). In the first step, a set of perplexity and log-likelihood based numeric features are elicited from an LM for a text instance to be classified. Then, in the second step, a classifier based on these features is trained to predict the final label. The classifier used is usually a simple machine learning classifier like Support Vector Machine (SVM) or Logistic Regression (LR) and it is trained using a small set of training examples. We believe, our technique presents a whole new way of exploiting the available training instances, in addition to the existing ways like fine-tuning LMs or in-context learning. Our approach stands out by eliminating the need for parameter updates in LMs, as required in fine-tuning, and does not impose limitations on the number of training examples faced while building prompts for in-context learning. We evaluate our technique across 5 different datasets and compare with multiple competent baselines.

pdf bib abs
Autism Detection in Speech – A Survey
Nadine Probol | Margot Mieskes

There has been a range of studies of how autism is displayed in voice, speech, and language. We analyse studies from the biomedical, as well as the psychological domain, but also from the NLP domain in order to find linguistic, prosodic and acoustic cues. Our survey looks at all three domains. We define autism and which comorbidities might influence the correct detection of the disorder. We especially look at observations such as verbal and semantic fluency, prosodic features, but also disfluencies and speaking rate. We also show word-based approaches and describe machine learning and transformer-based approaches both on the audio data as well as the transcripts. Lastly, we conclude, while there already is a lot of research, female patients seem to be severely under-researched. Also, most NLP research focuses on traditional machine learning methods instead of transformers. Additionally, we were unable to find research combining both features from audio and transcripts.

pdf bib abs
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Danae Sanchez Villegas | Daniel Preotiuc-Pietro | Nikolaos Aletras

Effectively leveraging multimodal information from social media posts is essential to various downstream tasks such as sentiment analysis, sarcasm detection or hate speech classification. Jointly modeling text and images is challenging because cross-modal semantics might be hidden or the relation between image and text is weak. However, prior work on multimodal classification of social media posts has not yet addressed these challenges. In this work, we present an extensive study on the effectiveness of using two auxiliary losses jointly with the main task during fine-tuning multimodal models. First, Image-Text Contrastive (ITC) is designed to minimize the distance between image-text representations within a post, thereby effectively bridging the gap between posts where the image plays an important role in conveying the post’s meaning. Second, Image-Text Matching (ITM) enhances the model’s ability to understand the semantic relationship between images and text, thus improving its capacity to handle ambiguous or loosely related posts. We combine these objectives with five multimodal models, demonstrating consistent improvements of up to 2.6 F1 score across five diverse social media datasets. Our comprehensive analysis shows the specific scenarios where each auxiliary task is most effective.

pdf bib abs
What the Weight?! A Unified Framework for Zero-Shot Knowledge Composition
Carolin Holtermann | Markus Frohmann | Navid Rekabsaz | Anne Lauscher

The knowledge encapsulated in a model is the core factor determining its final performance on downstream tasks. Much research in NLP has focused on efficient methods for storing and adapting different types of knowledge, e.g., in dedicated modularized structures, and on how to effectively combine these, e.g., by learning additional parameters. However, given the many possible options, a thorough understanding of the mechanisms involved in these compositions is missing, and hence it remains unclear which strategies to utilize. To address this research gap, we propose a novel framework for zero-shot module composition, which encompasses existing and some novel variations for selecting, weighting, and combining parameter modules under a single unified notion. Focusing on the scenario of domain knowledge and adapter layers, our framework provides a systematic unification of concepts, allowing us to conduct the first comprehensive benchmarking study of various zero-shot knowledge composition strategies. In particular, we test two module combination methods and five selection and weighting strategies for their effectiveness and efficiency in an extensive experimental setup. Our results highlight the efficacy of ensembling but also hint at the power of simple though often-ignored weighting methods. Further in-depth analyses allow us to understand the role of weighting vs. top-k selection, and show that, to a certain extent, the performance of adapter composition can even be predicted.

pdf bib abs
IndiFoodVQA: Advancing Visual Question Answering and Reasoning with a Knowledge-Infused Synthetic Data Generation Pipeline
Pulkit Agarwal | Settaluri Sravanthi | Pushpak Bhattacharyya

Large Vision Language Models (VLMs) like GPT-4, LLaVA, and InstructBLIP exhibit extraordinary capabilities for both knowledge understanding and reasoning. However, the reasoning capabilities of such models on sophisticated problems that require external knowledge of a specific domain have not been assessed well, due to the unavailability of necessary datasets. In this work, we release a first-of-its-kind dataset called IndiFoodVQA with around 16.7k data samples, consisting of explicit knowledge-infused questions, answers, and reasons. We also release IndiFoodKG, a related Knowledge Graph (KG) with 79k triples. The data has been created with minimal human intervention via an automated pipeline based on InstructBlip and GPT-3.5. We also present a methodology to extract knowledge from the KG and use it to both answer and reason upon the questions. We employ different models to report baseline zero-shot and fine-tuned results. Fine-tuned VLMs on our data showed an improvement of ~25% over the corresponding base model, highlighting the fact that current VLMs need domain-specific fine-tuning to excel in specialized settings. Our findings reveal that (1) explicit knowledge infusion during question generation helps in making questions that have more grounded knowledge, and (2) proper knowledge retrieval can often lead to better-answering potential in such cases. The data and code is available at https://github.com/SLSravanthi/IndifoodVQA.

pdf bib abs
MAPLE: Micro Analysis of Pairwise Language Evolution for Few-Shot Claim Verification
Xia Zeng | Arkaitz Zubiaga

Claim verification is an essential step in the automated fact-checking pipeline which assesses the veracity of a claim against a piece of evidence. In this work, we explore the potential of few-shot claim verification, where only very limited data is available for supervision. We propose MAPLE (Micro Analysis of Pairwise Language Evolution), a pioneering approach that explores the alignment between a claim and its evidence with a small seq2seq model and a novel semantic measure. Its innovative utilization of micro language evolution path leverages unlabelled pairwise data to facilitate claim verification while imposing low demand on data annotations and computing resources. MAPLE demonstrates significant performance improvements over SOTA baselines SEED, PET and LLaMA 2 across three fact-checking datasets: FEVER, Climate FEVER, and SciFact. Data and code are available.

pdf bib abs
Leveraging Open Information Extraction for More Robust Domain Transfer of Event Trigger Detection
David Dukić | Kiril Gashteovski | Goran Glavaš | Jan Snajder

Event detection is a crucial information extraction task in many domains, such as Wikipedia or news. The task typically relies on trigger detection (TD) – identifying token spans in the text that evoke specific events. While the notion of triggers should ideally be universal across domains, domain transfer for TD from high- to low-resource domains results in significant performance drops. We address the problem of negative transfer in TD by coupling triggers between domains using subject-object relations obtained from a rule-based open information extraction (OIE) system. We demonstrate that OIE relations injected through multi-task training can act as mediators between triggers in different domains, enhancing zero- and few-shot TD domain transfer and reducing performance drops, in particular when transferring from a high-resource source domain (Wikipedia) to a low(er)-resource target domain (news). Additionally, we combine this improved transfer with masked language modeling on the target domain, observing further TD transfer gains. Finally, we demonstrate that the gains are robust to the choice of the OIE system.

pdf bib abs
Exploring efficient zero-shot synthetic dataset generation for Information Retrieval
Tiago Almeida | Sérgio Matos

The broad integration of neural retrieval models into Information Retrieval (IR) systems is significantly impeded by the high cost and laborious process associated with the manual labelling of training data. Similarly, synthetic training data generation, a potential workaround, often requires expensive computational resources due to the reliance on large language models. This work explored the potential of small language models for efficiently creating high-quality synthetic datasets to train neural retrieval models. We aim to identify an optimal method to generate synthetic datasets, enabling training neural reranking models in document collections where annotated data is unavailable. We introduce a novel methodology, grounded in the principles of information theory, to select the most appropriate documents to be used as context for question generation. Then, we employ a small language model for zero-shot conditional question generation, supplemented by a filtering mechanism to ensure the quality of generated questions. Extensive evaluation on five datasets unveils the potential of our approach, outperforming unsupervised retrieval methods such as BM25 and pretrained monoT5. Our findings indicate that an efficiently generated “silver-standard” dataset allows effective training of neural rerankers in unlabeled scenarios. To ensure reproducibility and facilitate wider application, we will release a code repository featuring an accessible API for zero-shot synthetic question generation.

pdf bib abs
Clustering-based Sampling for Few-Shot Cross-Domain Keyphrase Extraction
Prakamya Mishra | Lincy Pattanaik | Arunima Sundar | Nishant Yadav | Mayank Kulkarni

Keyphrase extraction is the task of identifying a set of keyphrases present in a document that captures its most salient topics. Scientific domain-specific pre-training has led to achieving state-of-the-art keyphrase extraction performance with a majority of benchmarks being within the domain. In this work, we explore how to effectively enable the cross-domain generalization capabilities of such models without requiring the same scale of data. We primarily focus on the few-shot setting in non-scientific domain datasets such as OpenKP from the Web domain & StackEx from the StackExchange forum. We propose to leverage topic information intrinsically available in the data, to build a novel clustering-based sampling approach that facilitates selecting a few samples to label from the target domain facilitating building robust and performant models. This approach leads to large gains in performance of up to 26.35 points in F1 when compared to selecting few-shot samples uniformly at random. We also explore the setting where we have access to labeled data from the model’s pretraining domain corpora and perform gradual training which involves slowly folding in target domain data to the source domain data. Here we demonstrate further improvements in the model performance by up to 12.76 F1 points.

pdf bib abs
Random Smooth-based Certified Defense against Text Adversarial Attack
Zeliang Zhang | Wei Yao | Susan Liang | Chenliang Xu

Certified defense methods have identified their effectiveness against textual adversarial examples, which train models on the worst-case text generated by substituting words in original texts with synonyms. However, due to the discrete word embedding representations, the large search space hinders the robust training efficiency, resulting in significant time consumption. To overcome this challenge, motivated by the observation that synonym embedding has a small distance, we propose to treat the word substitution as a continuous perturbation on the word embedding representation. The proposed method Text-RS applies random smooth techniques to approximate the word substitution operation, offering a computationally efficient solution that outperforms conventional discrete methods and improves the robustness in training. The evaluation results demonstrate its effectiveness in defending against multiple textual adversarial attacks.

pdf bib abs
Clarifying the Path to User Satisfaction: An Investigation into Clarification Usefulness
Hossein A. Rahmani | Xi Wang | Mohammad Aliannejadi | Mohammadmehdi Naghiaei | Emine Yilmaz

Clarifying questions are an integral component of modern information retrieval systems, directly impacting user satisfaction and overall system performance. Poorly formulated questions can lead to user frustration and confusion, negatively affecting the system’s performance. This research addresses the urgent need to identify and leverage key features that contribute to the classification of clarifying questions, enhancing user satisfaction. To gain deeper insights into how different features influence user satisfaction, we conduct a comprehensive analysis, considering a broad spectrum of lexical, semantic, and statistical features, such as question length and sentiment polarity. Our empirical results provide three main insights into the qualities of effective query clarification: (1) specific questions are more effective than generic ones; (2) the subjectivity and emotional tone of a question play a role; and (3) shorter and more ambiguous queries benefit significantly from clarification. Based on these insights, we implement feature-integrated user satisfaction prediction using various classifiers, both traditional and neural-based, including random forest, BERT, and large language models. Our experiments show a consistent and significant improvement, particularly in traditional classifiers, with a minimum performance boost of 45%. This study presents invaluable guidelines for refining the formulation of clarifying questions and enhancing both user satisfaction and system performance.

Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks, but focus on conversational tasks has been rather limited. This is partly due to the high cost of obtaining non-English conversational data, which results in limited coverage. In this work, we introduce for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset that we created by translating the English-only Schema-Guided Dialogue (SGD) dataset (Rastogi et al., 2020) into 105 other languages. XSGD contains about 330k utterances per language. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts. We also investigate two different classifiers: NLI-based and vanilla classifiers, and test cross-lingual capability enabled by the aligned prompts. We evaluate our model’s cross-lingual generalization capabilities on two conversation tasks: slot-filling and intent classification. Our results demonstrate strong and efficient modeling ability of NLI-based classifiers and the large cross-lingual transfer improvements achieved by our aligned prompts, particularly in few-shot settings. We also conduct studies on large language models (LLMs) such as text-davinci-003 and ChatGPT in both zero- and few-shot settings. While LLMs exhibit impressive performance in English, their cross-lingual capabilities in other languages, particularly low-resource ones, are limited.

pdf bib abs
Correcting Language Model Outputs by Editing Salient Layers
Kshitij Mishra | Tamer Soliman | Anil Ramakrishna | Aram Galstyan | Anoop Kumar

Large language models can accumulate incorrect or outdated knowledge as the real world evolves. Compared to typical solutions such as retraining, retrieval augmented generation, model editing offers an effective yet low cost solution to address this issue. However, existing model editing algorithms employ manual selection of edit layers, which requires prior domain knowledge or expensive architecture-specific empirical layer selection methods, such as causal tracing. In this work, we propose SaLEM (Salient Layers Editing Model), an efficient solution for data driven layer selection for the model editing task. Our solution utilizes layer-wise saliency maps for layer selection, and matches the accuracy of prior approaches but with only 1/3 of their edits, enabling efficient updates to the parametric knowledge in large language models.

Many approaches to Natural Language Processing tasks often treat them as single-step problems, where an agent receives an instruction, executes it, and is evaluated based on the final outcome. However, language is inherently interactive, as evidenced by the back-and-forth nature of human conversations. In light of this, we posit that human-AI collaboration should also be interactive, with humans monitoring the work of AI agents and providing feedback that the agent can understand and utilize. Further, the AI agent should be able to detect when it needs additional information and proactively ask for help. Enabling this scenario would lead to more natural, efficient, and engaging human-AI collaboration.In this paper, we investigate these directions using the challenging task established by the IGLU competition, an interactive grounded language understanding task in a MineCraft-like world. We delve into multiple types of help players can give to the AI to guide it and analyze the impact of this help on behavior, resulting in performance improvements and an end-to-end interactive system.

pdf bib abs
Goodhart’s Law Applies to NLP’s Explanation Benchmarks
Jennifer Hsia | Danish Pruthi | Aarti Singh | Zachary Lipton

Despite the rising popularity of saliency-based explanations, the research community remains at an impasse, facing doubts concerning their purpose, efficacy, and tendency to contradict each other. Seeking to unite the community’s efforts around common goals, several recent works have proposed evaluation metrics. In this paper, we critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics, focusing our inquiry on natural language processing. First, we show that we can inflate a model’s comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our strategy exploits the tendency for extracted explanations and their complements to be “out-of-support” relative to each other and in-distribution inputs. Next, we demonstrate that the EVAL-X metrics can be inflated arbitrarily by a simple method that encodes the label, even though EVAL-X is precisely motivated to address such exploits. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.

pdf bib abs
Syllable-level lyrics generation from melody exploiting character-level language model
Zhe Zhang | Karol Lasocki | Yi Yu | Atsuhiro Takasu

The generation of lyrics tightly connected to accompanying melodies involves establishing a mapping between musical notes and syllables of lyrics. This process requires a deep understanding of music constraints and semantic patterns at syllable-level, word-level, and sentence-level semantic meanings. However, pre-trained language models specifically designed at the syllable level are publicly unavailable. To solve these challenging issues, we propose to exploit fine-tuning character-level language models for syllable-level lyrics generation from symbolic melody. In particular, our method aims to fine-tune a character-level pre-trained language model, allowing to incorporation of linguistic knowledge of the language model into the beam search process of a syllable-level Transformer generator network. Besides, by exploring ChatGPT-based evaluation of generated lyrics in addition to human subjective evaluation, we prove that our approach improves the coherence and correctness of generated lyrics, without the need to train expensive new language models.

Foundational large language models (LLMs) can be instruction-tuned to perform open-domain question answering, facilitating applications like chat assistants. While such efforts are often carried out in a single language, we empirically analyze cost-efficient strategies for multilingual scenarios. Our study employs the Alpaca dataset and machine translations of it to form multilingual data, which is then used to tune LLMs through either low-rank adaptation or full-parameter training. Under a controlled computation budget, comparisons show that multilingual tuning is on par or better than tuning a model for each language. Furthermore, multilingual tuning with downsampled data can be as powerful and more robust. Our findings serve as a guide for expanding language support through instruction tuning.

Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks, such as question answering and text summarization. However, their performance on sequence labeling tasks such as intent classification and slot filling (IC-SF), which is a central component in personal assistant systems, lags significantly behind discriminative models. Furthermore, there is a lack of substantive research on robustness of LLMs to various perturbations in the input prompts. The contributions of this paper are three-fold. First, we show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models. Next, we systematically analyze the performance deterioration of those fine-tuned models due to three distinct yet relevant types of input perturbations - oronyms, synonyms, and paraphrasing. Finally, we propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples. Our experiments show that PPCL can recover on an average 59% and 69% of the performance drop for IC and SF tasks, respectively. Furthermore, PPCL beats data augmentation approach while using ten times fewer augmented data samples.

pdf bib abs
Enhancing Society-Undermining Disinformation Detection through Fine-Grained Sentiment Analysis Pre-Finetuning
Tsung-Hsuan Pan | Chung-Chi Chen | Hen-Hsen Huang | Hsin-Hsi Chen

In the era of the digital world, while freedom of speech has been flourishing, it has also paved the way for disinformation, causing detrimental effects on society. Legal and ethical criteria are insufficient to address this concern, thus necessitating technological intervention. This paper presents a novel method leveraging pre-finetuning concept for efficient detection and removal of disinformation that may undermine society, as deemed by judicial entities. We argue the importance of detecting this type of disinformation and validate our approach with real-world data derived from court orders. Following a study that highlighted four areas of interest for rumor analysis, our research proposes the integration of a fine-grained sentiment analysis task in the pre-finetuning phase of language models, using the GoEmotions dataset. Our experiments validate the effectiveness of our approach in enhancing performance significantly. Furthermore, we explore the application of our approach across different languages using multilingual language models, showing promising results. To our knowledge, this is the first study that investigates the role of sentiment analysis pre-finetuning in disinformation detection.

Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant.In this paper, we propose a novel approach called Minimal Distillation Schedule (MiniDisc), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new 𝜆-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MiniDisc can select the optimal teacher assistant with the best 𝜆-tradeoff.We extensively evaluate MiniDisc through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MiniDisc by applying it to a language model with billions of parameters.

pdf bib abs
Event Semantic Classification in Context
Haoyu Wang | Hongming Zhang | Kaiqiang Song | Dong Yu | Dan Roth

In this work, we focus on a fundamental yet underexplored problem, event semantic classification in context, to help machines gain a deeper understanding of events. We classify events from six perspectives: modality, affirmation, specificity, telicity, durativity, and kinesis. These properties provide essential cues regarding the occurrence and grounding of events, changes of status that events can bring about, and the connection between events and time. To this end, this paper introduces a novel dataset collected for the semantic classification tasks and several effective models. By incorporating these event properties into downstream tasks, we demonstrate that understanding the fine-grained event semantics benefits downstream event understanding and reasoning via experiments on event extraction, temporal relation extraction, and subevent relation extraction.

pdf bib abs
Local and Global Contexts for Conversation
Zuoquan Lin | Xinyi Shen

The context in conversation is the dialog history crucial for multi-turn dialogue. Learning from the relevant contexts in dialog history for grounded conversation is a challenging problem. Local context is the most neighbor and more sensitive to the subsequent response, and global context is relevant to a whole conversation far beyond neighboring utterances. Currently, pretrained transformer models for conversation challenge capturing the correlation and connection between local and global contexts. We introduce a local and global conversation model (LGCM) for general-purpose conversation in open domain. It is a local-global hierarchical transformer model that excels at accurately discerning and assimilating the relevant contexts necessary for generating responses. It employs a local encoder to grasp the local context at the level of individual utterances and a global encoder to understand the broader context at the dialogue level. The seamless fusion of these locally and globally contextualized encodings ensures a comprehensive comprehension of the conversation. Experiments on popular datasets show that LGCM outperforms the existing conversation models on the performance of automatic metrics with significant margins.

pdf bib abs
Aspect-based Key Point Analysis for Quantitative Summarization of Reviews
An Tang | Xiuzhen Zhang | Minh Dinh

Key Point Analysis (KPA) is originally for summarizing arguments, where short sentences containing salient viewpoints are extracted as key points (KPs) and quantified for their prevalence as salience scores. Recently, KPA was applied to summarize reviews, but the study still relies on sentence-based KP extraction and matching, which leads to two issues: sentence-based extraction can result in KPs of overlapping opinions on the same aspects, and sentence-based matching of KP to review comment can be inaccurate, resulting in inaccurate salience scores. To address the above issues, in this paper, we propose Aspect-based Key Point Analysis (ABKPA), a novel framework for quantitative review summarization. Leveraging the readily available aspect-based sentiment analysis (ABSA) resources of reviews to automatically annotate silver labels for matching aspect-sentiment pairs, we propose a contrastive learning model to effectively match KPs to reviews and quantify KPs at the aspect level. Especially, the framework ensures extracting KP of distinct aspects and opinions, leading to more accurate opinion quantification. Experiments on five business categories of the popular Yelp review dataset show that ABKPA outperforms state-of-the-art baselines. Source code and data are available at: https://github.com/antangrocket1312/ABKPA

pdf bib abs
Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders
Yingji Zhang | Danilo Carvalho | Marco Valentino | Ian Pratt-Hartmann | Andre Freitas

Achieving precise semantic control over the latent spaces of Variational AutoEncoders (VAEs) holds significant value for downstream tasks in NLP as the underlying generative mechanisms could be better localised, explained and improved upon. Recent research, however, has struggled to achieve consistent results, primarily due to the inevitable loss of semantic information in the variational bottleneck and limited control over the decoding mechanism. To overcome these challenges, we investigate discrete latent spaces in Vector Quantized Variational AutoEncoder (VQVAE) to improve semantic control and generation in Transformer-based VAEs. In particular, We propose T5VQVAE, a novel model that leverages the controllability of VQVAE to guide the self-attention mechanism in T5, exploiting its full generalization capabilities. Experimental results indicate that T5VQVAE outperforms existing state-of-the-art VAE models, including Optimus, in terms of control and preservation of semantic information across different tasks such as auto-encoding of sentences and mathematical expressions, text transfer, and inference. Moreover, T5VQVAE exhibits improved reasoning capabilities, suggesting potential applications for downstream natural language and symbolic inference tasks.

pdf bib abs
High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models
Michela Lorandi | Anya Belz

The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric’s suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.

pdf bib abs
Antonym vs Synonym Distinction using InterlaCed Encoder NETworks (ICE-NET)
Muhammad Ali | Yan Hu | Jianbin Qin | Di Wang

Antonyms vs synonyms distinction is a core challenge in lexico-semantic analysis and automated lexical resource construction. These pairs share a similar distributional context which makes it harder to distinguish them. Leading research in this regard attempts to capture the properties of the relation pairs, i.e., symmetry, transitivity, and trans-transitivity. However, the inability of existing research to appropriately model the relation-specific properties limits their end performance. In this paper, we propose InterlaCed Encoder NETworks (i.e., ICE-NET) for antonym vs synonym distinction, that aim to capture and model the relation-specific properties of the antonyms and synonyms pairs in order to perform the classification task in a performance-enhanced manner. Experimental evaluation using the benchmark datasets shows that ICE-NET outperforms the existing research by a relative score of upto 1.8% in F1-measure.

Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.

Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying ‘red cube’ by reasoning over the constituents ‘red’ and ‘cube’. In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating ‘cube behind sphere’ from ‘sphere behind cube’). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets – single-object, two-object, and relational – designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.

pdf bib abs
Code-Switching and Back-Transliteration Using a Bilingual Model
Daniel Weisberg Mitelman | Nachum Dershowitz | Kfir Bar

The challenges of automated transliteration and code-switching–detection in Judeo-Arabic texts are addressed. We introduce two novel machine-learning models, one focused on transliterating Judeo-Arabic into Arabic, and another aimed at identifying non-Arabic words, predominantly Hebrew and Aramaic. Unlike prior work, our models are based on a bilingual Arabic-Hebrew language model, providing a unique advantage in capturing shared linguistic nuances. Evaluation results show that our models outperform prior solutions for the same tasks. As a practical contribution, we present a comprehensive pipeline capable of taking Judeo-Arabic text, identifying non-Arabic words, and then transliterating the Arabic portions into Arabic script. This work not only advances the state of the art but also offers a valuable toolset for making Judeo-Arabic texts more accessible to a broader Arabic-speaking audience.

pdf bib abs
Tsetlin Machine Embedding: Representing Words Using Logical Expressions
Bimal Bhattarai | Ole-Christoffer Granmo | Lei Jiao | Rohan Yadav | Jivitesh Sharma

Embedding words in vector space is a fundamental first step in state-of-the-art natural language processing (NLP). Typical NLP solutions employ pre-defined vector representations to improve generalization by co-locating similar words in vector space. For instance, Word2Vec is a self-supervised predictive model that captures the context of words using a neural network. Similarly, GLoVe is a popular unsupervised model incorporating corpus-wide word co-occurrence statistics. Such word embedding has significantly boosted important NLP tasks, including sentiment analysis, document classification, and machine translation. However, the embeddings are dense floating-point vectors, making them expensive to compute and difficult to interpret. In this paper, we instead propose to represent the semantics of words with a few defining words that are related using propositional logic. To produce such logical embeddings, we introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised. The clauses consist of contextual words like black, cup, and hot to define other words like coffee, thus being human-understandable. We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks. Furthermore, we investigate the interpretability of our embedding using the logical representations acquired during training. We also visualize word clusters in vector space, demonstrating how our logical embedding co-locate similar words.

pdf bib abs
Reading Between the Tweets: Deciphering Ideological Stances of Interconnected Mixed-Ideology Communities
Zihao He | Ashwin Rao | Siyi Guo | Negar Mokhberian | Kristina Lerman

Recent advances in NLP have improved our ability to understand the nuanced worldviews of online communities. Existing research focused on probing ideological stances treats liberals and conservatives as separate groups. However, this fails to account for the nuanced views of the organically formed online communities and the connections between them. In this paper, we study discussions of the 2020 U.S. election on Twitter to identify complex interacting communities. Capitalizing on this interconnectedness, we introduce a novel approach that harnesses message passing when finetuning language models (LMs) to probe the nuanced ideologies of these communities. By comparing the responses generated by LMs and real-world survey results, our method shows higher alignment than existing baselines, highlighting the potential of using LMs in revealing complex ideologies within and across interconnected mixed-ideology communities.

pdf bib abs
Unified Embeddings for Multimodal Retrieval via Frozen LLMs
Ziyang Wang | Heba Elfardy | Markus Dreyer | Kevin Small | Mohit Bansal

In this work, We present Unified Embeddings for Multimodal Retrieval (UniMuR), a simple but effective approach that embeds multimodal inputs and retrieves visual and textual outputs via frozen Large Language Models (LLMs). Specifically, UniMuR jointly retrieves multimodal outputs via a unified multimodal embedding and applies dual alignment training to account for both visual and textual semantics. Thus, unlike previous approaches, UniMuR significantly reduces LLM’s modality bias towards generating text-only outputs. Meanwhile, the proposed unified multimodal embedding mitigates the inconsistency between visual and textual outputs and provides coherent multimodal outputs. Furthermore, benefiting from the joint training of visual and textual semantics, UniMuR also achieves strong image/text retrieval ability. Compared to existing approaches, UniMuR achieves better zero-shot multimodal response retrieval performance on MMDialog, improving the overall R@1 by 6.5% while boosting the image retrieval rate and having better cross-modal consistency on multimodal outputs. UniMuR also achieves 2.4% and 3.9% improvement on context-based image retrieval tasks on MMDialog and VisDial respectively when compared to previous approaches, validating its generalization ability across multiple tasks.

pdf bib abs
Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods
Mohammed Mohammed | Anya Belz

As the cost of training ever larger language models has grown, so has the interest in reusing previously learnt knowledge. Transfer learning methods have shown how reusing non-task-specific knowledge can help in subsequent task-specific learning.In this paper, we investigate the inverse: porting whole functional modules that encode task-specific knowledge from one model to another. We designed a study comprising 1,440 training/testing runs to test the portability of modules trained by parameter-efficient finetuning (PEFT) techniques, using sentiment analysis as an example task. We test portability in a wide range of scenarios, involving different PEFT techniques and different pretrained host models, among other dimensions. We compare the performance of ported modules with that of equivalent modules trained (i) from scratch, and (ii) from parameters sampled from the same distribution as the ported module.We find that the ported modules far outperform the two alternatives tested, but that there are interesting differences between the four PEFT techniques tested.We conclude that task-specific knowledge in the form of structurally modular sets of parameters as produced by PEFT techniques is highly portable, but that degree of success depends on type of PEFT and on differences between originating and receiving pretrained models.

pdf bib abs
Exploiting Class Probabilities for Black-box Sentence-level Attacks
Raha Moraffah | Huan Liu

Sentence-level attacks craft adversarial sentences that are synonymous with correctly-classified sentences but are misclassified by the text classifiers. Under the black-box setting, classifiers are only accessible through their feedback to queried inputs, which is predominately available in the form of class probabilities. Even though utilizing class probabilities results in stronger attacks, due to the challenges of using them for sentence-level attacks, existing attacks use either no feedback or only the class labels. Overcoming the challenges, we develop a novel algorithm that uses class probabilities for black-box sentence-level attacks, investigate the effectiveness of using class probabilities on the attack’s success, and examine the question if it is worthy or practical to use class probabilities by black-box sentence-level attacks. We conduct extensive evaluations of the proposed attack comparing with the baselines across various classifiers and benchmark datasets.

pdf bib abs
Learning Label Hierarchy with Supervised Contrastive Learning
Ruixue Lian | William Sethares | Junjie Hu

Supervised contrastive learning (SCL) frameworks treat each class as independent and thus consider all classes to be equally important. This neglects the common scenario in which label hierarchy exists, where fine-grained classes under the same category show more similarity than very different ones. This paper introduces a family of Label-Aware SCL methods (LA-SCL) that incorporates hierarchical information to SCL by leveraging similarities between classes, resulting in creating a more well-structured and discriminative feature space. This is achieved by first adjusting the distance between instances based on measures of the proximity of their classes with the scaled instance-instance-wise contrastive. An additional instance-center-wise contrastive is introduced to move within-class examples closer to their centers, which are represented by a set of learnable label parameters. The learned label parameters can be directly used as a nearest neighbor classifier without further finetuning. In this way, a better feature representation is generated with improvements of intra-cluster compactness and inter-cluster separation. Experiments on three datasets show that the proposed LA-SCL works well on text classification of distinguishing a single label among multi-labels, outperforming the baseline supervised approaches. Our code is publicly available ¹.

Current conversational AI systems based on large language models (LLMs) are known to generate unsafe responses agreeing to offensive user input or including toxic content. Previous research aimed to alleviate the toxicity by fine-tuning LLM with manually annotated safe dialogue histories. However, the dependency on additional tuning requires substantial costs. To remove the dependency, we propose GrounDial, where response safety is achieved by grounding responses to commonsense social rules without requiring fine-tuning. A hybrid approach of in-context learning and human-norm-guided decoding of GrounDial enables the response to be quantitatively and qualitatively safer even without additional data or tuning.

pdf bib abs
Trainable Hard Negative Examples in Contrastive Learning for Unsupervised Abstractive Summarization
Haojie Zhuang | Wei Emma Zhang | Chang Dong | Jian Yang | Quan Sheng

Contrastive learning has demonstrated promising results in unsupervised abstractive summarization. However, existing methods rely on manually crafted negative examples, demanding substantial human effort and domain knowledge. Moreover, these human-generated negative examples may be poor in quality and lack adaptability during model training. To address these issues, we propose a novel approach that learns trainable negative examples for contrastive learning in unsupervised abstractive summarization, which eliminates the need for manual negative example design. Our framework introduces an adversarial optimization process between a negative example network and a representation network (including the summarizer and encoders). The negative example network is trained to synthesize hard negative examples that are close to the positive examples, driving the representation network to improve the quality of the generated summaries. We evaluate our method on two benchmark datasets for unsupervised abstractive summarization and observe significant performance improvements compared to strong baseline models.

pdf bib abs
Low-Resource Counterspeech Generation for Indic Languages: The Case of Bengali and Hindi
Mithun Das | Saurabh Pandey | Shivansh Sethi | Punyajoy Saha | Animesh Mukherjee

With the rise of online abuse, the NLP community has begun investigating the use of neural architectures to generate counterspeech that can “counter” the vicious tone of such abusive speech and dilute/ameliorate their rippling effect over the social network. However, most of the efforts so far have been primarily focused on English. To bridge the gap for low-resource languages such as Bengali and Hindi, we create a benchmark dataset of 5,062 abusive speech/counterspeech pairs, of which 2,460 pairs are in Bengali, and 2,602 pairs are in Hindi. We implement several baseline models considering various interlingual transfer mechanisms with different configurations to generate suitable counterspeech to set up an effective benchmark. We observe that the monolingual setup yields the best performance. Further, using synthetic transfer, language models can generate counterspeech to some extent; specifically, we notice that transferability is better when languages belong to the same language family.

pdf bib abs
Teaching Probabilistic Logical Reasoning to Transformers
Aliakbar Nafar | K. Brent Venable | Parisa Kordjamshidi

In this paper, we evaluate the capability of transformer-based language models in making inferences over uncertain text that includes uncertain rules of reasoning. We cover both Pre-trained Language Models (PLMs) and generative Large Language Models (LLMs). Our evaluation results show that both generations of language models struggle with reasoning over uncertain text. We propose a novel end-to-end fine-tuning approach, Probabilistic Constraint Training (PCT), that utilizes probabilistic logical rules as constraints in the fine-tuning phase without relying on these rules in the inference stage. To assess the effectiveness of PCT, we utilize the related corpora and, additionally, create a new and more challenging benchmark that, unlike the previous ones, uses instance-specific rules. Our study demonstrates that PCT improves the transformer-based language model’s intrinsic reasoning and makes their probabilistic logical reasoning process more explicit and explainable. Furthermore, PCT equips these models to effectively handle novel situations, including higher reasoning depth, new domains, and complex probabilistic structures.

pdf bib abs
On Measuring Context Utilization in Document-Level MT Systems
Wafaa Mohammed | Vlad Niculae

Document-level translation models are usually evaluated using general metrics such as BLEU, which are not informative about the benefits of context. Current work on context-aware evaluation, such as contrastive methods, only measure translation accuracy on words that need context for disambiguation. Such measures cannot reveal whether the translation model uses the correct supporting context. We propose to complement accuracy-based evaluation with measures of context utilization. We find that perturbation-based analysis (comparing models’ performance when provided with correct versus random context) is an effective measure of overall context utilization. For a finer-grained phenomenon-specific evaluation, we propose to measure how much the supporting context contributes to handling context-dependent discourse phenomena. We show that automatically-annotated supporting context gives similar conclusions to human-annotated context and can be used as alternative for cases where human annotations are not available. Finally, we highlight the importance of using discourse-rich datasets when assessing context utilization.

pdf bib abs
Solving NLP Problems through Human-System Collaboration: A Discussion-based Approach
Masahiro Kaneko | Graham Neubig | Naoaki Okazaki

Humans work together to solve common problems by having discussions, explaining, and agreeing or disagreeing with each other.Similarly, if a system can have discussions with human partners when solving tasks, it has the potential to improve the system’s performance and reliability.In previous research on explainability, it has only been possible for systems to make predictions and for humans to ask questions about them, rather than having a mutual exchange of opinions.This research aims to create a dataset and computational framework for systems that discuss and refine their predictions through dialogue. Through experiments, we show that the proposed system can have beneficial discussions with humans, improving the accuracy by up to 25 points on a natural language inference task.

pdf bib abs
Autoregressive Score Generation for Multi-trait Essay Scoring
Heejin Do | Yunsu Kim | Gary Lee

Recently, encoder-only pre-trained models such as BERT have been successfully applied in automated essay scoring (AES) to predict a single overall score. However, studies have yet to explore these models in multi-trait AES, possibly due to the inefficiency of replicating BERT-based models for each trait. Breaking away from the existing sole use of *encoder*, we propose an autoregressive prediction of multi-trait scores (ArTS), incorporating a *decoding* process by leveraging the pre-trained T5. Unlike prior regression or classification methods, we redefine AES as a score-generation task, allowing a single model to predict multiple scores. During decoding, the subsequent trait prediction can benefit by conditioning on the preceding trait scores. Experimental results proved the efficacy of ArTS, showing over 5% average improvements in both prompts and traits.

pdf bib abs
CMA-R: Causal Mediation Analysis for Explaining Rumour Detection
Lin Tian | Xiuzhen Zhang | Jey Han Lau

We apply causal mediation analysis to explain the decision-making process of neural models for rumour detection on Twitter.Interventions at the input and network level reveal the causal impacts of tweets and words in the model output.We find that our approach CMA-R – Causal Mediation Analysis for Rumour detection – identifies salient tweets that explain model predictions and show strong agreement with human judgements for critical tweets determining the truthfulness of stories.CMA-R can further highlight causally impactful words in the salient tweets, providing another layer of interpretability and transparency into these blackbox rumour detection systems. Code is available at: https://github.com/ltian678/cma-r.

pdf bib abs
Morphology Aware Source Term Masking for Terminology-Constrained NMT
Ander Corral | Xabier Saralegi

Terminology-constrained NMT systems facilitate the forced translation of domain-specific vocabulary. A notable method in this context is the “copy-and-inflect” approach, which appends the target term lemmas of constraints to their corresponding source terms in the input sentence. In this work, we propose a novel adaptation of the “copy-and-inflect” method, referred to as “morph-masking”. Our method involves masking the source terms of the constraints from the input sentence while retaining essential grammatical information. Our approach is based on the hypothesis that “copy-and-inflect” systems have access to both source and target terms, allowing them to generate the correct surface form of the constraint by either translating the source term itself or properly inflecting the target term lemma. Through extensive validation of our method in two translation directions with different levels of source morphological complexity, Basque to Spanish and English to German, we have demonstrated that “morph-masking” is capable of providing a harder constraint signal, resulting in a notable improvement over the “copy-and-inflect” method (up to 38% in term accuracy), especially in challenging constraint scenarios.

pdf bib abs
Improving Backchannel Prediction Leveraging Sequential and Attentive Context Awareness
Yo-Han Park | Wencke Liermann | Yong-Seok Choi | Kong Joo Lee

Backchannels, which refer to short and often affirmative or empathetic responses from a listener during a conversation, play a crucial role in effective communication. In this paper, we introduce CABP(Context-Aware Backchannel Prediction), a sequential and attentive context approach aimed at enhancing backchannel prediction performance. Additionally, CABP leverages the pretrained wav2vec model for encoding audio signal. Experimental results show that CABP performs better than context-free models, with performance improvements of 1.3% and 1.8% in Korean and English datasets, respectively. Furthermore, when utilizing the pretrained wav2vec model, CABP consistently demonstrates the best performance, achieving performance improvements of 4.4% and 3.1% in Korean and English datasets.

pdf bib abs
SENSE-LM : A Synergy between a Language Model and Sensorimotor Representations for Auditory and Olfactory Information Extraction
Cédric Boscher | Christine Largeron | Véronique Eglin | Elöd Egyed-Zsigmond

The five human senses – vision, taste, smell, hearing, and touch – are key concepts that shape human perception of the world. The extraction of sensory references (i.e., expressions that evoke the presence of a sensory experience) in textual corpus is a challenge of high interest, with many applications in various areas. In this paper, we propose SENSE-LM, an information extraction system tailored for the discovery of sensory references in large collections of textual documents. Based on the novel idea of combining the strength of large language models and linguistic resources such as sensorimotor norms, it addresses the task of sensory information extraction at a coarse-grained (sentence binary classification) and fine-grained (sensory term extraction) level.Our evaluation of SENSE-LM for two sensory functions, Olfaction and Audition, and comparison with state-of-the-art methods emphasize a significant leap forward in automating these complex tasks.

pdf bib abs
Analyzing the Role of Part-of-Speech in Code-Switching: A Corpus-Based Study
Jie Chi | Peter Bell

Code-switching (CS) is a common linguistic phenomenon wherein speakers fluidly transition between languages in conversation. While the cognitive processes driving CS remain a complex domain, earlier investigations have shed light on its multifaceted triggers. This study delves into the influence of Part-of-Speech (POS) on the propensity of bilinguals to engage in CS, employing a comprehensive analysis of Spanish-English and Mandarin-English corpora. Compared with prior research, our findings not only affirm the existence of a statistically significant connection between POS and the likelihood of CS across language pairs, but notably find this relationship exhibits its maximum strength in proximity to CS instances, progressively diminishing as tokens distance themselves from these CS points.

pdf bib abs
In-Contextual Gender Bias Suppression for Large Language Models
Daisuke Oba | Masahiro Kaneko | Danushka Bollegala

Despite their impressive performance in a wide range of NLP tasks, Large Language Models (LLMs) have been reported to encode worrying-levels of gender biases. Prior work has proposed debiasing methods that require human labelled examples, data augmentation and fine-tuning of LLMs, which are computationally costly. Moreover, one might not even have access to the model parameters for performing debiasing such as in the case of closed LLMs such as GPT-4. To address this challenge, we propose bias suppression that prevents biased generations of LLMs by simply providing textual preambles constructed from manually designed templates and real-world statistics, without accessing to model parameters. We show that, using CrowsPairs dataset, our textual preambles covering counterfactual statements can suppress gender biases in English LLMs such as LLaMA2. Moreover, we find that gender-neutral descriptions of gender-biased objects can also suppress their gender biases. Moreover, we show that bias suppression has acceptable adverse effect on downstream task performance with HellaSwag and COPA.

pdf bib abs
Parameter-Efficient Fine-Tuning: Is There An Optimal Subset of Parameters to Tune?
Max Ploner | Alan Akbik

The ever-growing size of pretrained language models (PLM) presents a significant challenge for efficiently fine-tuning and deploying these models for diverse sets of tasks within memory-constrained environments.In light of this, recent research has illuminated the possibility of selectively updating only a small subset of a model’s parameters during the fine-tuning process.Since no new parameters or modules are added, these methods retain the inference speed of the original model and come at no additional computational cost. However, an open question pertains to which subset of parameters should best be tuned to maximize task performance and generalizability. To investigate, this paper presents comprehensive experiments covering a large spectrum of subset selection strategies. We comparatively evaluate their impact on model performance as well as the resulting model’s capability to generalize to different tasks.Surprisingly, we find that the gains achieved in performance by elaborate selection strategies are, at best, marginal when compared to the outcomes obtained by tuning a random selection of parameter subsets. Our experiments also indicate that selection-based tuning impairs generalizability to new tasks.

This article proposes a new family of LLM-based topic coherence metrics called Contextualized Topic Coherence (CTC) and inspired by standard human topic evaluation methods. CTC metrics simulate human-centered coherence evaluation while maintaining the efficiency of other automated methods. We compare the performance of our CTC metrics and five other baseline metrics on seven topic models and show that CTC metrics better reflect human judgment, particularly for topics extracted from short text collections by avoiding highly scored topics that are meaningless to humans.

pdf bib abs
ProMISe: A Proactive Multi-turn Dialogue Dataset for Information-seeking Intent Resolution
Yash Butala | Siddhant Garg | Pratyay Banerjee | Amita Misra

Users of AI-based virtual assistants and search systems encounter challenges in articulating their intents while seeking information on unfamiliar topics, possibly due to complexity of the user’s intent or the lack of meta-information on the topic. We posit that an iterative suggested question-answering (SQA) conversation can improve the trade-off between the satisfaction of the user’s intent while keeping the information exchange natural and cognitive load of the interaction minimal on the users. In this paper, we evaluate a novel setting ProMISe by means of a sequence of interactions between a user, having a predefined information-seeking intent, and an agent that generates a set of SQA pairs at each step to aid the user to get closer to their intent. We simulate this two-player setting to create a multi-turn conversational dataset of SQAs and user choices (1025 dialogues comprising 4453 turns and 17812 SQAs) using human-feedback, chain-of-thought prompting and web-retrieval augmented large language models. We evaluate the quality of the SQs in the dataset on attributes such as diversity, specificity, grounding, etc, and benchmark the performance of different language models for the task of replicating user behavior.

pdf bib abs
CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation
Md Mahfuz Ibn Alam | Sina Ahmadi | Antonios Anastasopoulos

Neural machine translation (NMT) systems exhibit limited robustness in handling source-side linguistic variations. Their performance tends to degrade when faced with even slight deviations in language usage, such as different domains or variations introduced by second-language speakers. It is intuitive to extend this observation to encompass dialectal variations as well, but the work allowing the community to evaluate MT systems on this dimension is limited. To alleviate this issue, we compile and release CODET, a contrastive dialectal benchmark encompassing 891 different variations from twelve different languages. We also quantitatively demonstrate the challenges large MT models face in effectively translating dialectal variants. All the data and code have been released.

pdf bib abs
QAEVENT: Event Extraction as Question-Answer Pairs Generation
Milind Choudhary | Xinya Du

We propose a novel representation of document-level events as question and answer pairs (QAEVENT). Under this paradigm: (1) questions themselves can define argument roles without the need for predefined schemas, which will cover a comprehensive list of event arguments from the document; (2) it allows for more scalable and faster annotations from crowdworkers without linguistic expertise. Based on our new paradigm, we collect a novel and wide-coverage dataset. Our examinations show that annotations with the QA representations produce high-quality data for document-level event extraction, both in terms of human agreement level and high coverage of roles comparing to the pre-defined schema. We present and compare representative approaches for generating event question answer pairs on our benchmark.

pdf bib abs
Sequence Shortening for Context-Aware Machine Translation
Paweł Maka | Yusuf Semerci | Jan Scholtes | Gerasimos Spanakis

Context-aware Machine Translation aims to improve translations of sentences by incorporating surrounding sentences as context. Towards this task, two main architectures have been applied, namely single-encoder (based on concatenation) and multi-encoder models. In this study, we show that a special case of multi-encoder architecture, where the latent representation of the source sentence is cached and reused as the context in the next step, achieves higher accuracy on the contrastive datasets (where the models have to rank the correct translation among the provided sentences) and comparable BLEU and COMET scores as the single- and multi-encoder approaches. Furthermore, we investigate the application of Sequence Shortening to the cached representations. We test three pooling-based shortening techniques and introduce two novel methods - Latent Grouping and Latent Selecting, where the network learns to group tokens or selects the tokens to be cached as context. Our experiments show that the two methods achieve competitive BLEU and COMET scores and accuracies on the contrastive datasets to the other tested methods while potentially allowing for higher interpretability and reducing the growth of memory requirements with increased context size.

pdf bib abs
Jigsaw Pieces of Meaning: Modeling Discourse Coherence with Informed Negative Sample Synthesis
Shubhankar Singh

Coherence in discourse is fundamental for comprehension and perception. Much research on coherence modeling has focused on better model architectures and training setups optimizing on the permuted document task, where random permutations of a coherent document are considered incoherent. However, there’s very limited work on creating “informed” synthetic incoherent samples that better represent or mimic incoherence. We source a diverse positive corpus for local coherence and propose six rule-based methods leveraging information from Constituency trees, Part-of-speech, semantic overlap and more, for “informed” negative sample synthesis for better representation of incoherence. We keep a straightforward training setup for local coherence modeling by fine-tuning popular transformer models, and aggregate local scores for global coherence. We evaluate on a battery of independent downstream tasks to assess the impact of improved negative sample quality. We assert that a step towards optimality for coherence modeling requires better negative sample synthesis in tandem with model improvements.

pdf bib abs
Non-Exchangeable Conformal Language Generation with Nearest Neighbors
Dennis Ulmer | Chrysoula Zerva | Andre Martins

Quantifying uncertainty in automatically generated text is important for letting humans check potential hallucinations and making systems more reliable. Conformal prediction is an attractive framework to provide predictions imbued with statistical guarantees, however, its application to text generation is challenging since any i.i.d. assumptions are not realistic. In this paper, we bridge this gap by leveraging recent results on *non-exchangeable* conformal prediction, which still ensures bounds on coverage. The result, *non-exchangeable conformal nucleus sampling*, is a novel extension of the conformal prediction framework to generation based on nearest neighbors. Our method can be used post-hoc for an arbitrary model without extra training and supplies token-level, calibrated prediction sets equipped with statistical guarantees. Experiments in machine translation and language modeling show encouraging results in generation quality. By also producing tighter prediction sets with good coverage, we thus give a more theoretically principled way to perform sampling with conformal guarantees.

The long-standing goal of dense retrievers in abtractive open-domain question answering (ODQA) tasks is to learn to capture evidence passages among relevant passages for any given query, such that the reader produce factually correct outputs from evidence passages. One of the key challenge is the insufficient amount of training data with the supervision of the answerability of the passages. Recent studies rely on iterative pipelines to annotate answerability using signals from the reader, but their high computational costs hamper practical applications. In this paper, we instead focus on a data-driven approach and propose Evidentiality-Aware Dense Passage Retrieval (EADPR), which leverages synthetic distractor samples to learn to discriminate evidence passages from distractors. We conduct extensive experiments to validate the effectiveness of our proposed method on multiple abstractive ODQA tasks.

pdf bib abs
Self-training Strategies for Sentiment Analysis: An Empirical Study
Haochen Liu | Sai Rallabandi | Yijing Wu | Parag Dakle | Preethi Raghavan

Sentiment analysis is a crucial task in natural language processing that involves identifying and extracting subjective sentiment from text. Self-training has recently emerged as an economical and efficient technique for developing sentiment analysis models by leveraging a small amount of labeled data and a large amount of unlabeled data. However, given a set of training data, how to utilize them to conduct self-training makes a significant difference in the final performance of the model. We refer to this methodology as the self-training strategy. In this paper, we present an empirical study of various self-training strategies for sentiment analysis. First, we investigate the influence of the self-training strategy and hyper-parameters on the performance of traditional small language models (SLMs) in various few-shot settings. Second, we also explore the feasibility of leveraging large language models (LLMs) to help self-training. We propose and empirically compare several self-training strategies with the intervention of LLMs. Extensive experiments are conducted on three real-world sentiment analysis datasets.

pdf bib abs
Language is All a Graph Needs
Ruosong Ye | Caiqi Zhang | Runhui Wang | Shuyuan Xu | Yongfeng Zhang

The emergence of large-scale pre-trained language models has revolutionized various AI research domains. Transformers-based Large Language Models (LLMs) have gradually replaced CNNs and RNNs to unify fields of computer vision and natural language processing. Compared with independent data like images, videos or texts, graphs usually contain rich structural and relational information. Meanwhile, languages, especially natural language, being one of the most expressive mediums, excels in describing complex structures. However, existing work on incorporating graph problems into the generative language modeling framework remains very limited. Considering the rising prominence of LLMs, it becomes essential to explore whether LLMs can also replace GNNs as the foundation model for graphs. In this paper, we propose InstructGLM (Instruction-finetuned Graph Language Model) with highly scalable prompts based on natural language instructions. We use natural language to describe multi-scale geometric structure of the graph and then instruction finetune an LLM to perform graph tasks, which enables Generative Graph Learning. Our method surpasses all GNN baselines on ogbn-arxiv, Cora and PubMed datasets, underscoring its effectiveness and sheds light on generative LLMs as new foundation model for graph machine learning. Our code is available at https://github.com/agiresearch/InstructGLM.

pdf bib abs
Unraveling the Dynamics of Semi-Supervised Hate Speech Detection: The Impact of Unlabeled Data Characteristics and Pseudo-Labeling Strategies
Florian Ludwig | Klara Dolos | Ana Alves-Pinto | Torsten Zesch

Despite advances in machine learning based hate speech detection, the need for larges amounts of labeled training data for state-of-the-art approaches remains a challenge for their application. Semi-supervised learning addresses this problem by leveraging unlabeled data and thus reducing the amount of annotated data required. Underlying this approach is the assumption that labeled and unlabeled data follow similar distributions. This assumption however may not always hold, with consequences for real world applications. We address this problem by investigating the dynamics of pseudo-labeling, a commonly employed form of semi-supervised learning, in the context of hate speech detection. Concretely we analysed the influence of data characteristics and of two strategies for selecting pseudo-labeled samples: threshold- and ratio-based. The results show that the influence of data characteristics on the pseudo-labeling performances depends on other factors, such as pseudo-label selection strategies or model biases. Furthermore, the effectiveness of pseudo-labeling in classification performance is determined by the interaction between the number, hate ratio and accuracy of the selected pseudo-labels. Analysis of the results suggests an advantage of the threshold-based approach when labeled and unlabeled data arise from the same domain, whilst the ratio-based approach may be recommended in the opposite situation.

Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises of four distinct tasks and nine datasets, all featuring prompts designed to assess the models’ ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.

Negotiation is a crucial ability in human communication. Recently, there has been a resurgent research interest in negotiation dialogue systems, whose goal is to create intelligent agents that can assist people in resolving conflicts or reaching agreements. Although there have been many explorations into negotiation dialogue systems, a systematic review of this task has not been performed to date. We aim to fill this gap by investigating recent studies in the field of negotiation dialogue systems, and covering benchmarks, evaluations and methodologies within the literature. We also discuss potential future directions, including multi-modal, multi-party and cross-cultural negotiation scenarios. Our goal is to provide the community with a systematic overview of negotiation dialogue systems and to inspire future research.

pdf bib abs
Towards Understanding Counseling Conversations: Domain Knowledge and Large Language Models
Younghun Lee | Dan Goldwasser | Laura Schwab Reese

Understanding the dynamics of counseling conversations is an important task, yet it is a challenging NLP problem regardless of the recent advance of Transformer-based pre-trained language models. This paper proposes a systematic approach to examine the efficacy of domain knowledge and large language models (LLMs) in better representing conversations between a crisis counselor and a help seeker. We empirically show that state-of-the-art language models such as Transformer-based models and GPT models fail to predict the conversation outcome. To provide richer context to conversations, we incorporate human-annotated domain knowledge and LLM-generated features; simple integration of domain knowledge and LLM features improves the model performance by approximately 15%. We argue that both domain knowledge and LLM-generated features can be exploited to better characterize counseling conversations when they are used as an additional context to conversations.

pdf bib abs
Better Explain Transformers by Illuminating Important Information
Linxin Song | Yan Cui | Ao Luo | Freddy Lecue | Irene Li

Transformer-based models excel in various natural language processing (NLP) tasks, attracting countless efforts to explain their inner workings. Prior methods explain Transformers by focusing on the raw gradient and attention as token attribution scores, where non-relevant information is often considered during explanation computation, resulting in confusing results. In this work, we propose highlighting the important information and eliminating irrelevant information by a refined information flow on top of the layer-wise relevance propagation (LRP) method. Specifically, we consider identifying syntactic and positional heads as important attention heads and focus on the relevance obtained from these important heads. Experimental results demonstrate that irrelevant information does distort output attribution scores and then should be masked during explanation computation. Compared to eight baselines on both classification and question-answering datasets, our method consistently outperforms with over 3% to 33% improvement on explanation metrics, providing superior explanation performance. Our anonymous code repository is available at: https://anonymous.4open.science/r/MLRP-E676/

pdf bib abs
Testing the Depth of ChatGPT’s Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5’s Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking
David Bayani

In the months since its release, ChatGPT and its underlying model, GPT3.5, have garnered massive attention, due to their potent mix of capability and accessibility. While a niche industry of papers have emerged examining the scope of capabilities these models possess, language — whether natural or stylized like code — has been the vehicle to exchange information with the network. Drawing inspiration from the multi-modal knowledge we’d expect an agent with true understanding to possess, we examine GPT3.5’s aptitude for visual tasks, where the inputs feature ASCII-art without overt distillation into a lingual summary. In particular, we scrutinize its performance on carefully designed image recognition and generation tasks. An extended version of this write-up is available at: https://arxiv.org/abs/2307.16806 .

pdf bib abs
Cross-lingual Editing in Multilingual Language Models
Himanshu Beniwal | Kowsik D | Mayank Singh

The training of large language models (LLMs) necessitates substantial data and computational resources, and updating outdated LLMs entails significant efforts and resources. While numerous model editing techniques (METs) have emerged to efficiently update model outputs without retraining, their effectiveness in multilingual LLMs, where knowledge is stored in diverse languages, remains an underexplored research area. This research paper introduces the cross-lingual model editing (XME) paradigm, wherein a fact is edited in one language, and the subsequent update propagation is observed across other languages. To investigate the XME paradigm, we conducted experiments using BLOOM, mBERT, and XLM-RoBERTa using the two writing scripts: Latin (English, French, and Spanish) and Indic (Hindi, Gujarati, and Bengali). The results reveal notable performance limitations of state-of-the-art METs under the XME setting, mainly when the languages involved belong to two distinct script families. These findings highlight the need for further research and development of XME techniques to address these challenges. For more comprehensive information, the dataset used in this research and the associated code are publicly available at the following [URL](https://github.com/lingo-iitgn/XME).

Large language models (LLMs) have revolutionized natural language processing (NLP) by excelling at understanding and generating human-like text. However, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference by leveraging the modularity in networks and sorting sub-models based on computation/accuracy in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any Pre-Training and by only replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT). Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that this approach can unlock the potential of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. The efficacy of our proposed method was demonstrated by applying it to tune LLaMA 2 13B on the Stanford Alpaca dataset for instruction following and TriviaQA for closed-book question answering. Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit), all achieved with very efficient tuning and without additional memory usage during inference.

pdf bib abs
AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents
Abraham Owodunni | Aditya Yadavalli | Chris Emezue | Tobi Olatunji | Clinton Mbataku

Despite advancements in speech recognition, accented speech remains challenging. While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships highlighting geographic and genealogical similarities, capturing consistent phonological, and morphological regularities, all learned empirically from speech. Furthermore, we discover accent relationships previously uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate the effectiveness of AccentFold by showing that, for out-of-distribution (OOD) accents, sampling accent subsets for training based on AccentFold information outperforms strong baselines a relative WER improvement of 4.6%. AccentFold presents a promising approach for improving ASR performance on accented speech, particularly in the context of African accents, where data scarcity and budget constraints pose significant challenges. Our findings emphasize the potential of leveraging linguistic relationships to improve zero-shot ASR adaptation to target accents.

pdf bib abs
Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage
Yichen Jiang | Marco Vecchio | Mohit Bansal | Anders Johannsen

Long prompts present a significant challenge for practical LLM-based systems that need to operate with low latency and limited resources. We investigate prompt compression for zero-shot dialogue systems that learn to use unseen APIs directly in-context from their documentation, which may take up hundreds of prompt tokens per API. We start from a recently introduced approach (Mu et al., 2023) that learns to compress the prompt into a few “gist token” activations during finetuning. However, this simple idea is ineffective in compressing API documentation, resulting in low accuracy compared to the baseline using an uncompressed prompt. In this work, we introduce two major improvements. First, we specialize gist tokens for different hierarchies within an API: we use one Gist_arg token for compressing an argument and one Gist_value token for compressing an acceptable value of a categorical argument. We then dynamically reveal Gist_value tokens only when they are needed. Second, we add a reconstruction loss to predict the API documentation from the gist tokens. On multiple API-calling tasks, our proposed system keeps the simplicity, efficiency, and large compression factor (20x on SGD) of the gist token approach while achieving significantly better accuracy.

Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 7.6% and 9.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.

Educational question-answer generation has been extensively researched owing to its practical applicability. However, we have identified a persistent challenge concerning the evaluation of such systems. Existing evaluation methods often fail to produce objective results and instead exhibit a bias towards favoring high similarity to the ground-truth question-answer pairs. In this study, we demonstrate that these evaluation methods yield low human alignment and propose an alternative approach called Generative Interpretation (GI) to achieve more objective evaluations. Through experimental analysis, we reveal that GI outperforms existing evaluation methods in terms of human alignment, and even shows comparable performance with GPT3.5, only with BART-large.

pdf bib abs
Dive into the Chasm: Probing the Gap between In- and Cross-Topic Generalization
Andreas Waldis | Yufang Hou | Iryna Gurevych

Pre-trained language models (PLMs) perform well in In-Topic setups, where training and testing data come from the same topics. However, they face challenges in Cross-Topic scenarios where testing data is derived from distinct topics. This paper analyzes various PLMs with three probing-based experiments to better understand the reasons behind such generalization gaps. For the first time, we demonstrate that the extent of these generalization gaps and the sensitivity to token-level interventions vary significantly across PLMs. By evaluating large language models (LLMs), we show the usefulness of our analysis for these recent models. Overall, we observe diverse pre-training objectives and architectural regularization contribute to more robust PLMs and mitigate generalization gaps. Our research contributes to a deeper understanding and comparison of language models across different generalization scenarios.

pdf bib abs
LLM-GEm: Large Language Model-Guided Prediction of People’s Empathy Levels towards Newspaper Article
Md Rakibul Hasan | Md Zakir Hossain | Tom Gedeon | Shafin Rahman

Empathy – encompassing the understanding and supporting others’ emotions and perspectives – strengthens various social interactions, including written communication in healthcare, education and journalism. Detecting empathy using AI models by relying on self-assessed ground truth through crowdsourcing is challenging due to the inherent noise in such annotations. To this end, we propose a novel system, named Large Language Model-Guided Empathy _(LLM-GEm)_ prediction system. It rectifies annotation errors based on our defined annotation selection threshold and makes the annotations reliable for conventional empathy prediction models, e.g., BERT-based pre-trained language models (PLMs). Previously, demographic information was often integrated numerically into empathy detection models. In contrast, our _LLM-GEm_ leverages GPT-3.5 LLM to convert numerical data into semantically meaningful textual sequences, enabling seamless integration into PLMs. We experiment with three _NewsEmpathy_ datasets involving people’s empathy levels towards newspaper articles and achieve state-of-the-art test performance using a RoBERTa-based PLM. Code and evaluations are publicly available at [https://github.com/hasan-rakibul/LLM-GEm](https://github.com/hasan-rakibul/LLM-GEm).

pdf bib abs
ICE-Score: Instructing Large Language Models to Evaluate Code
Terry Yue Zhuo

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine translation and summarization, their applicability in code intelligence tasks remains limited without human involvement. The complexity of programming concepts required for such tasks makes it difficult to develop evaluation metrics that align with human judgment. Token-matching-based metrics, such as BLEU, have demonstrated weak correlations with human practitioners in code intelligence tasks. Moreover, utilizing human-written test suites to evaluate functional correctness can be challenging in domains with low resources. To overcome these obstacles, we propose ICE-Score, a new evaluation metric via instructing large language models (LLMs) for code assessments. Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences, without the need for test oracles or references. We evaluate the efficacy of our metric on two different aspects (human preference and execution success) and four programming languages. Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation, delivering high levels of accuracy and consistency across various programming languages and tasks. We also make our evaluation metric and datasets available to the public, encouraging further research in evaluating code intelligence tasks.

Eligibility criteria (EC) refer to a set of conditions an individual must meet to participate in a clinical trial, defining the study population and minimizing potential risks to patients. Previous research in clinical trial design has been primarily focused on searching for similar trials and generating EC within manual instructions, employing similarity-based performance metrics, which may not fully reflect human judgment. In this study, we propose a novel task of recommending EC based on clinical trial information, including trial titles, and introduce an automatic evaluation framework to assess the clinical validity of the EC recommendation model. Our new approach, known as CReSE (Contrastive learning and Rephrasing-based and Clinical Relevance-preserving Sentence Embedding), represents EC through contrastive learning and rephrasing via large language models (LLMs). The CReSE model outperforms existing language models pre-trained on the biomedical domain in EC clustering. Additionally, we have curated a benchmark dataset comprising 3.2M high-quality EC-title pairs extracted from 270K clinical trials available on ClinicalTrials.gov. The EC recommendation models achieve commendable performance metrics, with 49.0% precision@1 and 44.2% MAP@5 on our evaluation framework. We expect that our evaluation framework built on the CReSE model will contribute significantly to the development and assessment of the EC recommendation models in terms of clinical validity.

pdf bib abs
BMX: Boosting Natural Language Generation Metrics with Explainability
Christoph Leiter | Hoa Nguyen | Steffen Eger

State-of-the-art natural language generation evaluation metrics are based on black-box language models. Hence, recent works consider their explainability with the goals of better understandability for humans and better metric analysis, including failure cases. In contrast, we explicitly leverage explanations to boost the metrics’ performance. In particular, we perceive feature importance explanations as word-level scores, which we convert, via power means, into a segment-level score. We then combine this segment-level score with the original metric to obtain a better metric. Our tests show improvements for multiple metrics across MT and summarization datasets. While improvements on machine translation are small, they are strong for summarization. Notably, BMX with the LIME explainer and preselected parameters achieves an average improvement of 0.087 points in Spearman correlation on the system-level evaluation of SummEval.

pdf bib abs
Joint Inference of Retrieval and Generation for Passage Re-ranking
Wei Fang | Yung-Sung Chuang | James Glass

Passage retrieval is a crucial component of modern open-domain question answering (QA) systems, providing information for downstream QA components to generate accurate and transparent answers. In this study we focus on passage re-ranking, proposing a simple yet effective method, Joint Passage Re-ranking (JPR), that optimizes the mutual information between query and passage distributions, integrating both cross-encoders and generative models in the re-ranking process. Experimental results demonstrate that JPR outperforms conventional re-rankers and language model scorers in both open-domain QA retrieval settings and diverse retrieval benchmarks under zero-shot settings.

Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues, making it an incredibly rich and diverse resource for dialogue research and model training.To further enhance the utility of DialogStudio, we identify the licenses for each dataset, design external knowledge and domain-aware prompts for selected dialogues to facilitate instruction-aware fine-tuning. To improve transparency and support dataset and task-based research, as well as language model pre-training, all datasets, licenses, codes, and models associated with DialogStudio will be made publicly accessible.

Linguistic features have a strong contribution in the context of the automatic assessment of text readability (ARA). They have been one of the anchors between the computational and theoretical models. With the development in the ARA field, the research moved to Deep Learning (DL). In an attempt to reconcile the mixed results reported in this context, we present a systematic comparison of 6 hybrid approaches along with standard Machine Learning and DL approaches, on 4 corpora (different languages and target audiences). The various experiments clearly highlighted two rather simple hybridization methods (soft label and simple concatenation). They also appear to be the most robust on smaller datasets and across various tasks and languages. This study stands out as the first to systematically compare different architectures and approaches to feature hybridization in DL, as well as comparing performance in terms of two languages and two target audiences of the text, which leads to a clearer pattern of results.

pdf bib abs
Establishing degrees of closeness between audio recordings along different dimensions using large-scale cross-lingual models
Maxime Fily | Guillaume Wisniewski | Severine Guillaume | Gilles Adda | Alexis Michaud

In the highly constrained context of low-resource language studies, we explore vector representations of speech from a pretrained model to determine their level of abstraction with regard to the audio signal. We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata to shed light on the type of information present in the representations. ABX tests determine whether the representations computed by a multilingual speech model encode a given characteristic. Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects. The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines. Embedding more audio signal in one vector better discriminates extra-linguistic characteristics, whereas shorter snippets are better to distinguish segmental information. The method is fully unsupervised, potentially opening new research avenues for comparative work on under-documented languages.

pdf bib abs
The Queen of England is not England’s Queen: On the Lack of Factual Coherency in PLMs
Paul Youssef | Jörg Schlötterer | Christin Seifert

Factual knowledge encoded in Pre-trained Language Models (PLMs) enriches their representations and justifies their use as knowledge bases. Previous work has focused on probing PLMs for factual knowledge by measuring how often they can correctly predict an _object_ entity given a subject and a relation, and improving fact retrieval by optimizing the prompts used for querying PLMs. In this work, we consider a complementary aspect, namely the coherency of factual knowledge in PLMs, i.e., how often can PLMs predict the _subject_ entity given its initial prediction of the object entity. This goes beyond evaluating how much PLMs know, and focuses on the internal state of knowledge inside them. Our results indicate that PLMs have low coherency using manually written, optimized and paraphrased prompts, but including an evidence paragraph leads to substantial improvement. This shows that PLMs fail to model inverse relations and need further enhancements to be able to handle retrieving facts from their parameters in a coherent manner, and to be considered as knowledge bases.

pdf bib abs
HierarchyNet: Learning to Summarize Source Code with Heterogeneous Representations
Minh Nguyen | Nghi Bui | Truong Son Hy | Long Tran-Thanh | Tien Nguyen

Code representation is important to machine learning models in the code-related applications. Existing code summarization approaches primarily leverage Abstract Syntax Trees (ASTs) and sequential information from source code to generate code summaries while often overlooking the critical consideration of the interplay of dependencies among code elements and code hierarchy. However, effective summarization necessitates a holistic analysis of code snippets from three distinct aspects: lexical, syntactic, and semantic information. In this paper, we propose a novel code summarization approach utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs adeptly capture essential code features at lexical, syntactic, and semantic levels within a hierarchical structure. HierarchyNet processes each layer of the HCR separately, employing a Heterogeneous Graph Transformer, a Tree-based CNN, and a Transformer Encoder. In addition, HierarchyNet demonstrates superior performance compared to fine-tuned pre-trained models, including CodeT5, and CodeBERT, as well as large language models that employ zero/few-shot settings, such as CodeLlama, StarCoder, and CodeGen. Implementation details can be found at https://github.com/FSoft-AI4Code/HierarchyNet.

pdf bib abs
Understanding the effects of language-specific class imbalance in multilingual fine-tuning
Vincent Jung | Lonneke Plas

We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.

Writing formulas on spreadsheets, such as Microsoft Excel and Google Sheets, is a widespread practice among users performing data analysis. However, crafting formulas on spreadsheets remains a tedious and error-prone task for many end-users, particularly when dealing with complex operations. To alleviate the burden associated with writing spreadsheet formulas, this paper introduces a novel benchmark task called NL2Formula, with the aim to generate executable formulas that are grounded on a spreadsheet table, given a Natural Language (NL) query as input. To accomplish this, we construct a comprehensive dataset consisting of 70,799 paired NL queries and corresponding spreadsheet formulas, covering 21,670 tables and 37 types of formula functions. We realize the NL2Formula task by providing a sequence-to-sequence baseline implementation called fCoder. Experimental results validate the effectiveness of fCoder, demonstrating its superior performance compared to the baseline models. Furthermore, we also compare fCoder with an initial GPT-3.5 model (i.e., text-davinci-003). Lastly, through in-depth error analysis, we identify potential challenges in the NL2Formula task and advocate for further investigation.