SIGDial Conference (2023)


up

pdf (full)
bib (full)
Proceedings of the 16th International Natural Language Generation Conference

pdf bib
Proceedings of the 16th International Natural Language Generation Conference
C. Maria Keet | Hung-Yi Lee | Sina Zarrieß

pdf bib
Guided Beam Search to Improve Generalization in Low-Resource Data-to-Text Generation
Nicolas Garneau | Luc Lamontagne

In this paper, we introduce a new beam search algorithm that improves the generalization of neural generators to unseen examples, especially in low-resource data-to-text settings. Our algorithm aims to reduce the number of omissions and hallucinations during the decoding process. For this purpose, it relies on two regression models to explicitly characterize factual errors. We explain how to create a new dataset to train these models given an original training set of less than a thousand data points. We apply our approach in the low-resource, legal setting using the French Plum2Text dataset, as well as in English using WebNLG. We observe in our experiment that this combination improves the faithfulness of pre-trained neural text generators using both human and automatic evaluation. Moreover, our approach offers a level of interpretability by predicting the number of omissions and hallucinations present in a given generation with respect to the input data. Finally, we visualize our algorithm’s exploration of the hypothesis space at different steps during the decoding process.

pdf bib
XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages
Shivprasad Sagare | Tushar Abhishek | Bhavyajeet Singh | Anubhav Sharma | Manish Gupta | Vasudeva Varma

Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. This has resulted into substantial work on fact-to-text generation systems recently. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XAlign for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XAlign dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XAlignV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results (30.90 BLEU, 55.12 METEOR and 59.17 chrF++) across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.

pdf bib
Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy
Daphne Ippolito | Florian Tramer | Milad Nasr | Chiyuan Zhang | Matthew Jagielski | Katherine Lee | Christopher Choquette Choo | Nicholas Carlini

Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works—and some recently deployed defenses—focus on “verbatim memorization”, defined as a model generation that exactly matches a substring from the training set. We argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization. Specifically, we design and implement an efficient defense that _perfectly_ prevents all verbatim memorization. And yet, we demonstrate that this “perfect” filter does not prevent the leakage of training data. Indeed, it is easily circumvented by plausible and minimally modified “style-transfer” prompts—and in some cases even the non-modified original prompts—to extract memorized information. We conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models.

pdf bib
Fine-Tuning GPT-3 for Synthetic Danish News Generation
Mina Almasi | Anton Schiønning

While GPT-3 has garnered significant attention for its capabilities in natural language generation, research on its use outside of English is still relatively limited. We focus on how GPT-3 can be fine-tuned for generating synthetic news articles in a low-resource language, namely Danish. The model’s performance is evaluated on the dimensions of human and machine detection in two separate experiments. When presented with either a real or GPT-3 generated news article, human participants achieve a 58.1% classification accuracy. Contrarily, a fine-tuned BERT classifier obtains a 92.7% accuracy on the same task. This discrepancy likely pertains to the fine-tuned GPT-3 model oversampling high-likelihood tokens in its text generation. Although this is undetectable to the human eye, it leaves a statistical discrepancy for machine classifiers to detect. We address how decisions in the experimental design favoured the machine classifiers over the human evaluators, and whether the produced synthetic articles are applicable in a real-world context.

pdf bib
GAN-LM: Generative Adversarial Network using Language Models for Downstream Applications
Dae Yon Hwang | Yaroslav Nechaev | Cyprien de Lichy | Renxian Zhang

In this work, we investigate Data Augmentation methods to improve the performance of state-of-the-art models for four different downstream tasks. Specifically, we propose Generative Adversarial Network using Language Models (GAN-LM) approach that combines a deep generative model with a pre-trained language model to produce diverse augmentations. We compare the GAN-LM to various conventional methods in non-contextual- and contextual-levels on four public datasets: ZESHEL for zero-shot entity linking, TREC for question classification, STS-B for sentence pairs semantic textual similarity (STS), and mSTS for multilingual sentence pairs STS. Additionally, we subsample these datasets to study the impact of such augmentations in low-resource settings where limited amounts of training data is available. Compared to the state-of-the-art methods in downstream tasks, we mostly achieve the best performance using GAN-LM approach. Finally, we investigate the way of combining the GAN-LM with other augmentation methods to complement our proposed approach. The developed code for reproducibility is included in the supplementary material.

pdf bib
Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization
Chieh-Yang Huang | Ting-Yao Hsu | Ryan Rossi | Ani Nenkova | Sungchul Kim | Gromit Yeuk-Yin Chan | Eunyee Koh | C Lee Giles | Ting-Hao Huang

Good figure captions help paper readers understand complex scientific figures. Unfortunately, even published papers often have poorly written captions. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality. Prior work often treated figure caption generation as a vision-to-language task. In this paper, we show that it can be more effectively tackled as a text summarization task in scientific documents. We fine-tuned PEGASUS, a pre-trained abstractive summarization model, to specifically summarize figure-referencing paragraphs (e.g., “Figure 3 shows...”) into figure captions. Experiments on large-scale arXiv figures show that our method outperforms prior vision methods in both automatic and human evaluations. We further conducted an in-depth investigation focused on two key challenges: (i) the common presence of low-quality author-written captions and (ii) the lack of clear standards for good captions. Our code and data are available at: https://github.com/Crowd-AI-Lab/Generating-Figure-Captions-as-a-Text-Summarization-Task.

pdf bib
Models of reference production: How do they withstand the test of time?
Fahime Same | Guanyi Chen | Kees van Deemter

In recent years, many NLP studies have focused solely on performance improvement. In this work, we focus on the linguistic and scientific aspects of NLP. We use the task of generating referring expressions in context (REG-in-context) as a case study and start our analysis from GREC, a comprehensive set of shared tasks in English that addressed this topic over a decade ago. We ask what the performance of models would be if we assessed them (1) on more realistic datasets, and (2) using more advanced methods. We test the models using different evaluation metrics and feature selection experiments. We conclude that GREC can no longer be regarded as offering a reliable assessment of models’ ability to mimic human reference production, because the results are highly impacted by the choice of corpus and evaluation metrics. Our results also suggest that pre-trained language models are less dependent on the choice of corpus than classic Machine Learning models, and therefore make more robust class predictions.

pdf bib
Generating Faithful Text From a Knowledge Graph with Noisy Reference Text
Tahsina Hashem | Weiqing Wang | Derry Tanti Wijaya | Mohammed Eunus Ali | Yuan-Fang Li

Knowledge Graph (KG)-to-Text generation aims at generating fluent natural-language text that accurately represents the information of a given knowledge graph. While significant progress has been made in this task by exploiting the power of pre-trained language models (PLMs) with appropriate graph structure-aware modules, existing models still fall short of generating faithful text, especially when the ground-truth natural-language text contains additional information that is not present in the graph. In this paper, we develop a KG-to-text generation model that can generate faithful natural-language text from a given graph, in the presence of noisy reference text. Our framework incorporates two core ideas: Firstly, we utilize contrastive learning to enhance the model’s ability to differentiate between faithful and hallucinated information in the text, thereby encouraging the decoder to generate text that aligns with the input graph. Secondly, we empower the decoder to control the level of hallucination in the generated text by employing a controllable text generation technique. We evaluate our model’s performance through the standard quantitative metrics as well as a ChatGPT-based quantitative and qualitative analysis. Our evaluation demonstrates the superior performance of our model over state-of-the-art KG-to-text models on faithfulness.

pdf bib
Entropy-based Sampling for Abstractive Multi-document Summarization in Low-resource Settings
Laura Mascarell | Ribin Chalumattu | Julien Heitmann

Research in Multi-document Summarization (MDS) mostly focuses on the English language and depends on large MDS datasets that are not available for other languages. Some of these approaches concatenate the source documents, resulting in overlong model inputs. Existing transformer architectures are unable to process such long inputs entirely, omitting documents in the summarization process. Other solutions address this issue by implementing multi-stage approaches that also require changes in the model architecture. In this paper, we introduce various sampling approaches based on information entropy that allow us to perform MDS in a single stage. These approaches also consider all source documents without using MDS training data nor changing the model’s architecture. Besides, we build a MDS test set of German news articles to assess the performance of our methods on abstractive multi-document summaries. Experimental results show that our entropy-based approaches outperform previous state-of-the-art on German MDS, while still remaining primarily abstractive. We release our code and MDS test set to encourage further research in German abstractive MDS.

pdf bib
Claim Optimization in Computational Argumentation
Gabriella Skitalinskaya | Maximilian Spliethöver | Henning Wachsmuth

An optimal delivery of arguments is key to persuasion in any debate, both for humans and for AI systems. This requires the use of clear and fluent claims relevant to the given debate. Prior work has studied the automatic assessment of argument quality extensively. Yet, no approach actually improves the quality so far. To fill this gap, this paper proposes the task of claim optimization: to rewrite argumentative claims in order to optimize their delivery. As multiple types of optimization are possible, we approach this task by first generating a diverse set of candidate claims using a large language model, such as BART, taking into account contextual information. Then, the best candidate is selected using various quality metrics. In automatic and human evaluation on an English-language corpus, our quality-based candidate selection outperforms several baselines, improving 60% of all claims (worsening 16% only). Follow-up analyses reveal that, beyond copy editing, our approach often specifies claims with details, whereas it adds less evidence than humans do. Moreover, its capabilities generalize well to other domains, such as instructional texts.

pdf bib
ChatGPT’s Information Seeking Strategy: Insights from the 20-Questions Game
Leonardo Bertolazzi | Davide Mazzaccara | Filippo Merlo | Raffaella Bernardi

Large Language Models, and ChatGPT in particular, have recently grabbed the attention of the community and the media. Having reached high language proficiency, attention has been shifting toward its reasoning capabilities. In this paper, our main aim is to evaluate ChatGPT’s question generation in a task where language production should be driven by an implicit reasoning process. To this end, we employ the 20-Questions game, traditionally used within the Cognitive Science community to inspect the information seeking-strategy’s development. This task requires a series of interconnected skills: asking informative questions, stepwise updating the hypothesis space, and stopping asking questions when enough information has been collected. We build hierarchical hypothesis spaces, exploiting feature norms collected from humans vs. ChatGPT itself, and we inspect the efficiency and informativeness of ChatGPT’s strategy. Our results show that ChatGPT’s performance gets closer to an optimal agent only when prompted to explicitly list the updated space stepwise.

pdf bib
This is not correct! Negation-aware Evaluation of Language Generation Systems
Miriam Anschütz | Diego Miguel Lozano | Georg Groh

Large language models underestimate the impact of negations on how much they change the meaning of a sentence. Therefore, learned evaluation metrics based on these models are insensitive to negations. In this paper, we propose NegBLEURT, a negation-aware version of the BLEURT evaluation metric. For that, we designed a rule-based sentence negation tool and used it to create the CANNOT negation evaluation dataset. Based on this dataset, we fine-tuned a sentence transformer and an evaluation metric to improve their negation sensitivity. Evaluating these models on existing benchmarks shows that our fine-tuned models outperform existing metrics on the negated sentences by far while preserving their base models’ performances on other perturbations.

pdf bib
Guidance in Radiology Report Summarization: An Empirical Evaluation and Error Analysis
Jan Trienes | Paul Youssef | Jörg Schlötterer | Christin Seifert

Automatically summarizing radiology reports into a concise impression can reduce the manual burden of clinicians and improve the consistency of reporting. Previous work aimed to enhance content selection and factuality through guided abstractive summarization. However, two key issues persist. First, current methods heavily rely on domain-specific resources to extract the guidance signal, limiting their transferability to domains and languages where those resources are unavailable. Second, while automatic metrics like ROUGE show progress, we lack a good understanding of the errors and failure modes in this task. To bridge these gaps, we first propose a domain-agnostic guidance signal in form of variable-length extractive summaries. Our empirical results on two English benchmarks demonstrate that this guidance signal improves upon unguided summarization while being competitive with domain-specific methods. Additionally, we run an expert evaluation of four systems according to a taxonomy of 11 fine-grained errors. We find that the most pressing differences between automatic summaries and those of radiologists relate to content selection including omissions (up to 52%) and additions (up to 57%). We hypothesize that latent reporting factors and corpus-level inconsistencies may limit models to reliably learn content selection from the available data, presenting promising directions for future work.

pdf bib
A Zero-Shot Approach for Multi-User Task-Oriented Dialog Generation
Shiv Surya | Yohan Jo | Arijit Biswas | Alexandros Potamianos

Prior art investigating task-oriented dialog and automatic generation of such dialogs have focused on single-user dialogs between a single user and an agent. However, there is limited study on adapting such AI agents to multi-user conversations (involving multiple users and an agent). Multi-user conversations are richer than single-user conversations containing social banter and collaborative decision making. The most significant challenge impeding such studies is the lack of suitable multi-user task-oriented dialogs with annotations of user belief states and system actions. One potential solution is multi-user dialog generation from single-user data. Many single-user dialogs datasets already contain dialog state information (intents, slots), thus making them suitable candidates. In this work, we propose a novel approach for expanding single-user task-oriented dialogs (e.g. MultiWOZ) to multi-user dialogs in a zero-shot setting.

pdf bib
Beyond the Bias: Unveiling the Quality of Implicit Causality Prompt Continuations in Language Models
Judith Sieker | Oliver Bott | Torgrim Solstad | Sina Zarrieß

Recent studies have used human continuations of Implicit Causality (IC) prompts collected in linguistic experiments to evaluate discourse understanding in large language models (LLMs), focusing on the well-known IC coreference bias in the LLMs’ predictions of the next word following the prompt. In this study, we investigate how continuations of IC prompts can be used to evaluate the text generation capabilities of LLMs in a linguistically controlled setting. We conduct an experiment using two open-source GPT-based models, employing human evaluation to assess different aspects of continuation quality. Our findings show that LLMs struggle in particular with generating coherent continuations in this rather simple setting, indicating a lack of discourse knowledge beyond the well-known IC bias. Our results also suggest that a bias congruent continuation does not necessarily equate to a higher continuation quality. Furthermore, our study draws upon insights from the Uniform Information Density hypothesis, testing different prompt modifications and decoding procedures and showing that sampling-based methods are particularly sensitive to the information density of the prompts.

pdf bib
Enhancing factualness and controllability of Data-to-Text Generation via data Views and constraints
Craig Thomson | Clement Rebuffel | Ehud Reiter | Laure Soulier | Somayajulu Sripada | Patrick Gallinari

Neural data-to-text systems lack the control and factual accuracy required to generate useful and insightful summaries of multidimensional data. We propose a solution in the form of data views, where each view describes an entity and its attributes along specific dimensions. A sequence of views can then be used as a high-level schema for document planning, with the neural model handling the complexities of micro-planning and surface realization. We show that our view-based system retains factual accuracy while offering high-level control of output that can be tailored based on user preference or other norms within the domain.

pdf bib
Memories for Virtual AI Characters
Fabian Landwehr | Erika Varis Doggett | Romann M. Weber

In this paper, we present a system for augmenting virtual AI characters with long-term memory, enabling them to remember facts about themselves, their world, and past experiences. We propose a memory-creation pipeline that converts raw text into condensed memories and a memory-retrieval system that utilizes these memories to generate character responses. Using a fact-checking pipeline based on GPT-4, our evaluation demonstrates that the character responses are grounded in the retrieved memories and maintain factual accuracy. We discuss the implications of our system for creating engaging and consistent virtual characters and highlight areas for future research, including large language model (LLM) guardrailing and virtual character personality development.

pdf bib
Metric-Based In-context Learning: A Case Study in Text Simplification
Subhadra Vadlamannati | Gözde Şahin

In-context learning (ICL) for large language models has proven to be a powerful approach for many natural language processing tasks. However, determining the best method to select examples for ICL is nontrivial as the results can vary greatly depending on the quality, quantity, and order of examples used. In this paper, we conduct a case study on text simplification (TS) to investigate how to select the best and most robust examples for ICL. We propose Metric-Based in-context Learning (MBL) method that utilizes commonly used TS metrics such as SARI, compression ratio, and BERT-Precision for selection. Through an extensive set of experiments with various-sized GPT models on standard TS benchmarks such as TurkCorpus and ASSET, we show that examples selected by the top SARI scores perform the best on larger models such as GPT-175B, while the compression ratio generally performs better on smaller models such as GPT-13B and GPT-6.7B. Furthermore, we demonstrate that MBL is generally robust to example orderings and out-of-domain test sets, and outperforms strong baselines and state-of-the-art finetuned language models. Finally, we show that the behavior of large GPT models can be implicitly controlled by the chosen metric. Our research provides a new framework for selecting examples in ICL, and demonstrates its effectiveness in text simplification tasks, breaking new ground for more accurate and efficient NLG systems.

pdf bib
Exploring the Naturalness of Cognitive Status-Informed Referring Form Selection Models
Gabriel Del Castillo | Grace Clark | Zhao Han | Tom Williams

Language-capable robots must be able to efficiently and naturally communicate about objects in the environment. A key part of communication is Referring Form Selection (RFS): the process of selecting a form like it, that, or the N to use when referring to an object. Recent cognitive status-informed computational RFS models have been evaluated in terms of goodness-of-fit to human data. But it is as yet unclear whether these models actually select referring forms that are any more natural than baseline alternatives, regardless of goodness-of-fit. Through a human subject study designed to assess this question, we show that even though cognitive status-informed referring selection models achieve good fit to human data, they do not (yet) produce concrete benefits in terms of naturality. On the other hand, our results show that human utterances also had high variability in perceived naturality, demonstrating the challenges of evaluating RFS naturality.

pdf bib
System-Initiated Transitions from Chit-Chat to Task-Oriented Dialogues with Transition Info Extractor and Transition Sentence Generator
Ye Liu | Stefan Ultes | Wolfgang Minker | Wolfgang Maier

In this work, we study dialogue scenarios that start from chit-chat but eventually switch to task-related services, and investigate how a unified dialogue model, which can engage in both chit-chat and task-oriented dialogues, takes the initiative during the dialogue mode transition from chit-chat to task-oriented in a coherent and cooperative manner. We firstly build a transition info extractor (TIE) that keeps track of the preceding chit-chat interaction and detects the potential user intention to switch to a task-oriented service. Meanwhile, in the unified model, a transition sentence generator (TSG) is extended through efficient Adapter tuning and transition prompt learning. When the TIE successfully finds task-related information from the preceding chit-chat, such as a transition domain (“train” in Figure fig: system-initiated transition from chit-chat to task-oriented.), then the TSG is activated automatically in the unified model to initiate this transition by generating a transition sentence under the guidance of transition information extracted by TIE. The experimental results show promising performance regarding the proactive transitions. We achieve an additional large improvement on TIE model by utilizing Conditional Random Fields (CRF). The TSG can flexibly generate transition sentences while maintaining the unified capabilities of normal chit-chat and task-oriented response generation.

pdf bib
HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales
Michele Cafagna | Kees van Deemter | Albert Gatt

Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. “people eating food in a park”. Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict (“people at a holiday resort”) and the actions they perform (“people having a picnic”). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.

pdf bib
Validating Predictive Models Of Evaluative Language For Controllable Data2Text Generation
Maurice Langner | Ralf Klabunde

In data2text generation, tabular data is transformed into a text that expresses information from that source domain. While some text types, such as instructions, demand objective and neutral language without any expressive and evaluative content, many other text types are expected to provide expressions for these kinds of subjective meanings. In controllable, pipelined neural NLG separate learning models, notably regression models, can be used to predict whether some feature deviates sufficiently strongly from an expected value, so that evaluative language would be appropriate for verbalizing this finding. In this paper, we present an empirical study on the comprehension of evaluative adverbs and adjectival modifiers in car reviews, a text type that is characterized by a mixture of factual information with evaluations expressing positive or negative surprise. We show to what extend regression-based decision boundaries for producing evaluative content in controllable data2text NLG match the reader’s expectations that are raised by those evaluative markers. Finally we show that regression values in combination with standard deviation of the technical input data constitute reasonable Boolean thresholds for both positive and negative surprise, which provide the basis for the development of more complex models that also include the scalar base of adverbs and modifiers.

pdf bib
The Next Chapter: A Study of Large Language Models in Storytelling
Zhuohan Xie | Trevor Cohn | Jey Han Lau

To enhance the quality of generated stories, recent story generation models have been investigating the utilization of higher-level attributes like plots or commonsense knowledge. The application of prompt-based learning with large language models (LLMs), exemplified by GPT-3, has exhibited remarkable performance in diverse natural language processing (NLP) tasks. This paper conducts a comprehensive investigation, utilizing both automatic and human evaluation, to compare the story generation capacity of LLMs with recent models across three datasets with variations in style, register, and length of stories. The results demonstrate that LLMs generate stories of significantly higher quality compared to other story generation models. Moreover, they exhibit a level of performance that competes with human authors, albeit with the preliminary observation that they tend to replicate real stories in situations involving world knowledge, resembling a form of plagiarism.

pdf bib
Trustworthiness of Children Stories Generated by Large Language Models
Prabin Bhandari | Hannah Brennan

Large Language Models (LLMs) have shown a tremendous capacity for generating literary text. However, their effectiveness in generating children’s stories has yet to be thoroughly examined. In this study, we evaluate the trustworthiness of children’s stories generated by LLMs using various measures, and we compare and contrast our results with both old and new children’s stories to better assess their significance. Our findings suggest that LLMs still struggle to generate children’s stories at the level of quality and nuance found in actual stories.

pdf bib
On Text Style Transfer via Style-Aware Masked Language Models
Sharan Narasimhan | Pooja H | Suvodip Dey | Maunendra Sankar Desarkar

Text Style Transfer (TST) is performable through approaches such as latent space disentanglement, cycle-consistency losses, prototype editing etc. The prototype editing approach, which is known to be quite successful in TST, involves two key phases a) Masking of source style-associated tokens and b) Reconstruction of this source-style masked sentence conditioned with the target style. We follow a similar transduction method, in which we transpose the more difficult direct source to target TST task to a simpler Style-Masked Language Model (SMLM) Task, wherein, similar to BERT (CITATION), the goal of our model is now to reconstruct the source sentence from its style-masked version. We arrive at the SMLM mechanism naturally by formulating prototype editing/ transduction methods in a probabilistic framework, where TST resolves into estimating a hypothetical parallel dataset from a partially observed parallel dataset, wherein each domain is assumed to have a common latent style-masked prior. To generate this style-masked prior, we use “Explainable Attention” as our choice of attribution for a more precise style-masking step and also introduce a cost-effective and accurate “Attribution-Surplus” method of determining the position of masks from any arbitrary attribution model in O(1) time. We empirically show that this non-generational approach well suites the “content preserving” criteria for a task like TST, even for a complex style like Discourse Manipulation. Our model, the Style MLM, outperforms strong TST baselines and is on par with state-of-the-art TST models, which use complex architectures and orders of more parameters.

pdf bib
Affective Natural Language Generation of Event Descriptions through Fine-grained Appraisal Conditions
Yarik Menchaca Resendiz | Roman Klinger

Models for affective text generation have shown a remarkable progress, but they commonly rely only on basic emotion theories or valance/arousal values as conditions. This is appropriate when the goal is to create explicit emotion statements (“The kid is happy.”). Emotions are, however, commonly communicated implicitly. For instance, the emotional interpretation of an event (“Their dog died.”) does often not require an explicit emotion statement. In psychology, appraisal theories explain the link between a cognitive evaluation of an event and the potentially developed emotion. They put the assessment of the situation on the spot, for instance regarding the own control or the responsibility for what happens. We hypothesize and subsequently show that including appraisal variables as conditions in a generation framework comes with two advantages. (1) The generation model is informed in greater detail about what makes a specific emotion and what properties it has. This leads to text generation that better fulfills the condition. (2) The variables of appraisal allow a user to perform a more fine-grained control of the generated text, by stating properties of a situation instead of only providing the emotion category. Our Bart and T5-based experiments with 7 emotions (Anger, Disgust, Fear, Guilt, Joy, Sadness, Shame), and 7 appraisals (Attention, Responsibility, Control, Circumstance, Pleasantness, Effort, Certainty) show that (1) adding appraisals during training improves the accurateness of the generated texts by 10 pp in F1. Further, (2) the texts with appraisal variables are longer and contain more details. This exemplifies the greater control for users.

pdf bib
Leveraging Low-resource Parallel Data for Text Style Transfer
Sourabrata Mukherjee | Ondrej Dusek

Text style transfer (TST) involves transforming a text into a desired style while approximately preserving its content. The biggest challenge in TST in the general lack of parallel data. Many existing approaches rely on complex models using substantial non-parallel data, with mixed results. In this paper, we leverage a pretrained BART language model with minimal parallel data and incorporate low-resource methods such as hyperparameter tuning, data augmentation, and self-training, which have not been explored in TST. We further include novel style-based rewards in the training loss. Through extensive experiments in sentiment transfer, a sub-task of TST, we demonstrate that our simple yet effective approaches achieve well-balanced results, surpassing non-parallel approaches and highlighting the usefulness of parallel data even in small amounts.

pdf bib
Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System
Daphne Ippolito | Nicholas Carlini | Katherine Lee | Milad Nasr | Yun William Yu

Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-_k_ or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model’s predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).

pdf bib
Controlling keywords and their positions in text generation
Yuichi Sasazawa | Terufumi Morishita | Hiroaki Ozaki | Osamu Imaichi | Yasuhiro Sogawa

One of the challenges in text generation is to control text generation as intended by the user. Previous studies proposed specifying the keywords that should be included in the generated text. However, this approach is insufficient to generate text that reflect the user’s intent. For example, placing an important keyword at the beginning of the text would help attract the reader’s attention; however, existing methods do not enable such flexible control. In this paper, we tackle a novel task of controlling not only keywords but also the position of each keyword in the text generation. To this end, we propose a task-independent method that uses special tokens to control the relative position of keywords. Experimental results on summarization and story generation tasks show that the proposed method can control keywords and their positions. The experimental results also demonstrate that controlling the keyword positions can generate summary texts that are closer to the user’s intent than baseline.

pdf bib
Tackling Hallucinations in Neural Chart Summarization
Saad Obaid ul Islam | Iza Škrjanec | Ondrej Dusek | Vera Demberg

Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we tackle the problem of hallucinations in neural chart summarization. Our analysis shows that the target side of chart summarization training datasets often contains additional information, leading to hallucinations. We propose a natural language inference (NLI) based method to preprocess the training data and show through human evaluation that our method significantly reduces hallucinations. We also found that shortening long-distance dependencies in the input sequence and adding chart-related information like title and legends improves the overall performance.

pdf bib
Learning Disentangled Meaning and Style Representations for Positive Text Reframing
Xu Sheng | Fumiyo Fukumoto | Jiyi Li | Go Kentaro | Yoshimi Suzuki

The positive text reframing (PTR) task which generates a text giving a positive perspective with preserving the sense of the input text, has attracted considerable attention as one of the NLP applications. Due to the significant representation capability of the pre-trained language model (PLM), a beneficial baseline can be easily obtained by just fine-tuning the PLM. However, how to interpret a diversity of contexts to give a positive perspective is still an open problem. Especially, it is more serious when the size of the training data is limited. In this paper, we present a PTR framework, that learns representations where the meaning and style of text are structurally disentangled. The method utilizes pseudo-positive reframing datasets which are generated with two augmentation strategies. A simple but effective multi-task learning-based model is learned to fuse the generation capabilities from these datasets. Experimental results on Positive Psychology Frames (PPF) dataset, show that our approach outperforms the baselines, BART by five and T5 by six evaluation metrics. Our source codes and data are available online.

pdf bib
Generating clickbait spoilers with an ensemble of large language models
Mateusz Woźny | Mateusz Lango

Clickbait posts are a widespread problem in the webspace. The generation of spoilers, i.e. short texts that neutralize clickbait by providing information that makes it uninteresting, is one of the proposed solutions to the problem. Current state-of-the-art methods are based on passage retrieval or question answering approaches and are limited to generating spoilers only in the form of a phrase or a passage. In this work, we propose an ensemble of fine-tuned large language models for clickbait spoiler generation. Our approach is not limited to phrase or passage spoilers, but is also able to generate multipart spoilers that refer to several non-consecutive parts of text. Experimental evaluation demonstrates that the proposed ensemble model outperforms the baselines in terms of BLEU, METEOR and BERTScore metrics.

pdf bib
Reducing named entity hallucination risk to ensure faithful summary generation
Eunice Akani | Benoit Favre | Frederic Bechet | Romain Gemignani

The faithfulness of abstractive text summarization at the named entities level is the focus of this study. We propose to add a new criterion to the summary selection method based on the “risk” of generating entities that do not belong to the source document. This method is based on the assumption that Out-Of-Document entities are more likely to be hallucinations. This assumption was verified by a manual annotation of the entities occurring in a set of generated summaries on the CNN/DM corpus. This study showed that only 29% of the entities outside the source document were inferrable by the annotators, leading to 71% of hallucinations among OOD entities. We test our selection method on the CNN/DM corpus and show that it significantly reduces the hallucination risk on named entities while maintaining competitive results with respect to automatic evaluation metrics like ROUGE.

pdf bib
Building a dual dataset of text- and image-grounded conversations and summarisation in Gàidhlig (Scottish Gaelic)
David M. Howcroft | William Lamb | Anna Groundwater | Dimitra Gkatzia

Gàidhlig (Scottish Gaelic; gd) is spoken by about 57k people in Scotland, but remains an under-resourced language with respect to natural language processing in general and natural language generation (NLG) in particular. To address this gap, we developed the first datasets for Scottish Gaelic NLG, collecting both conversational and summarisation data in a single setting. Our task setup involves dialogues between a pair of speakers discussing museum exhibits, grounding the conversation in images and texts. Then, both interlocutors summarise the dialogue resulting in a secondary dialogue summarisation dataset. This paper presents the dialogue and summarisation corpora, as well as the software used for data collection. The corpus consists of 43 conversations (13.7k words) and 61 summaries (2.0k words), and will be released along with the data collection interface.

pdf bib
Generating Multiple Questions from Presentation Transcripts: A Pilot Study on Earnings Conference Calls
Yining Juan | Chung-Chi Chen | Hen-Hsen Huang | Hsin-Hsi Chen

In various scenarios, such as conference oral presentations, company managers’ talks, and politicians’ speeches, individuals often contemplate the potential questions that may arise from their presentations. This common practice prompts the research question addressed in this study: to what extent can models generate multiple questions based on a given presentation transcript? To investigate this, we conduct pilot explorations using earnings conference call transcripts, which serve as regular meetings between professional investors and company managers. We experiment with different task settings and methods and evaluate the results from various perspectives. Our findings highlight that incorporating key points retrieval techniques enhances the accuracy and diversity of the generated questions.

pdf bib
Mod-D2T: A Multi-layer Dataset for Modular Data-to-Text Generation
Simon Mille | Francois Lareau | Stamatia Dasiopoulou | Anya Belz

Rule-based text generators lack the coverage and fluency of their neural counterparts, but have two big advantages over them: (i) they are entirely controllable and do not hallucinate; and (ii) they can fully explain how an output was generated from an input. In this paper we leverage these two advantages to create large and reliable synthetic datasets with multiple human-intelligible intermediate representations. We present the Modular Data-to-Text (Mod-D2T) Dataset which incorporates ten intermediate-level representations between input triple sets and output text; the mappings from one level to the next can broadly be interpreted as the traditional modular tasks of an NLG pipeline. We describe the Mod-D2T dataset, evaluate its quality via manual validation and discuss its applications and limitations. Data, code and documentation are available at https://github.com/mille-s/Mod-D2T.

up

pdf (full)
bib (full)
Proceedings of the 16th International Natural Language Generation Conference: System Demonstrations

pdf bib
Proceedings of the 16th International Natural Language Generation Conference: System Demonstrations
C. Maria Keet | Hung-Yi Lee | Sina Zarrieß

pdf bib
Overview of MiReportor: Generating Reports for Multimodal Medical Images
Xuwen Wang | Hetong Ma | Zhen Guo | Jiao Li

This demo paper presents a brief introduction of MiReportor, a computer-aided medical imaging report generator, which leverages a unified framework of medical image understanding and generation to predict readable descriptions for medical images, and assists radiologists in imaging reports writing.

pdf bib
enunlg: a Python library for reproducible neural data-to-text experimentation
David M. Howcroft | Dimitra Gkatzia

Over the past decade, a variety of neural architectures for data-to-text generation (NLG) have been proposed. However, each system typically has its own approach to pre- and post-processing and other implementation details. Diversity in implementations is desirable, but it also confounds attempts to compare model performance: are the differences due to the proposed architectures or are they a byproduct of the libraries used or a result of pre- and post-processing decisions made? To improve reproducibility, we re-implement several pre-Transformer neural models for data-to-text NLG within a single framework to facilitate direct comparisons of the models themselves and better understand the contributions of other design choices. We release our library at https://github.com/NapierNLP/enunlg to serve as a baseline for ongoing work in this area including research on NLG for low-resource languages where transformers might not be optimal.

pdf bib
VisuaLLM: Easy Web-based Visualization for Neural Language Generation
František Trebuňa | Ondrej Dusek

VisuaLLM is a Python library that enables interactive visualization of common tasks in natural language generation with pretrained language models (using HuggingFace’s model API), with tight integration of benchmark datasets and fine-grained generation control. The system runs as a local generation backend server and features a web-based frontend, allowing simple interface configuration by minimal Python code. The currently implemented views include data visualization, next-token prediction with probability distributions, and decoding parameter control, with simple extension to additional tasks.

pdf bib
Audio Commentary System for Real-Time Racing Game Play
Tatsuya Ishigaki | Goran Topić | Yumi Hamazono | Ichiro Kobayashi | Yusuke Miyao | Hiroya Takamura

Live commentaries are essential for enhancing spectators’ enjoyment and understanding during sports events or e-sports streams. We introduce a live audio commentator system designed specifically for a racing game, driven by the high demand in the e-sports field. While a player is playing a racing game, our system tracks real-time user play data including speed and steer rotations, and generates commentary to accompany the live stream. Human evaluation suggested that generated commentary enhances enjoyment and understanding of races compared to streams without commentary. Incorporating additional modules to improve diversity and detect irregular events, such as course-outs and collisions, further increases the preference for the output commentaries.

up

pdf (full)
bib (full)
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

pdf bib
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges
Simon Mille

pdf bib
LOWRECORP: the Low-Resource NLG Corpus Building Challenge
Khyathi Raghavi Chandu | David M. Howcroft | Dimitra Gkatzia | Yi-Ling Chung | Yufang Hou | Chris Chinenye Emezue | Pawan Rajpoot | Tosin Adewumi

Most languages in the world do not have sufficient data available to develop neural-network-based natural language generation (NLG) systems. To alleviate this resource scarcity, we propose a novel challenge for the NLG community: low-resource language corpus development (LOWRECORP). We present an innovative framework to collect a single dataset with dual tasks to maximize the efficiency of data collection efforts and respect language consultant time. Specifically, we focus on a text-chat-based interface for two generation tasks – conversational response generation grounded in a source document and/or image and dialogue summarization (from the former task). The goal of this shared task is to collectively develop grounded datasets for local and low-resourced languages. To enable data collection, we make available web-based software that can be used to collect these grounded conversations and summaries. Submissions will be assessed for the size, complexity, and diversity of the corpora to ensure quality control of the datasets as well as any enhancements to the interface or novel approaches to grounding conversations.

pdf bib
Long Story Generation Challenge
Nikolay Mikhaylovskiy

We propose a shared task of human-like long story generation, LSG Challenge, that asks models to output a consistent human-like long story (a Harry Potter generic audience fanfic in English), given a prompt of about 1K tokens. We suggest a novel statistical metric of the text structuredness, GloVe Autocorrelations Power/ Exponential Law Mean Absolute Percentage Error Ratio (GAPELMAPER) and the use of previously-known UNION metric and a human evaluation protocol. We hope that LSG can open new avenues for researchers to investigate sampling approaches, prompting strategies, autoregressive and non-autoregressive text generation architectures and break the barrier to generate consistent long (40K+ word) texts.

pdf bib
Visually Grounded Story Generation Challenge
Xudong Hong | Khushboo Mehra | Asad Sayeed | Vera Demberg

Recent large pre-trained models have achieved strong performance in multimodal language generation, which requires a joint effort of vision and language modeling. However, most previous generation tasks are based on single image input and produce short text descriptions that are not grounded on the input images. In this work, we propose a shared task on visually grounded story generation. The input is an image sequence, and the output is a story that is conditioned on the input images. This task is particularly challenging because: 1) the protagonists in the generated stories need to be grounded in the images and 2) the output story should be a coherent long-form text. We aim to advance the study of vision-based story generation by accepting submissions that propose new methods as well as new evaluation measures.

pdf bib
The VDG Challenge: Response Generation and Evaluation in Collaborative Visual Dialogue
Nikolai Ilinykh | Simon Dobnik

We propose the VDG Challenge: a shared task that addresses and benchmarks the task of utterance generation in collaborative visual dialogue. The task features two challenging datasets, an evaluation protocol and a tentative schedule. Our shared task will allow researchers to unravel problems of modelling multi-modal interaction and fit of the existing approaches in the NLP and NLG communities.

pdf bib
Identifying Feedback Types to Augment Feedback Comment Generation
Maja Stahl | Henning Wachsmuth

In the context of language learning, feedback comment generation is the task of generating hints or explanatory notes for learner texts that help understand why a part of text is erroneous. This paper presents our approach to the Feedback Comment Generation Shared Task, collocated with the 16th International Natural Language Generation Conference (INLG 2023). The approach augments the generation of feedback comments by a self-supervised identification of feedback types in a multitasklearning setting. Within the shared task, other approaches performed more effective, yet the combined modeling of feedback type classification and feedback comment generation is superior to performing eedback generation only.

pdf bib
Error syntax aware augmentation of feedback comment generation dataset
Nikolay Babakov | Maria Lysyuk | Alexander Shvets | Lilya Kazakova | Alexander Panchenko

This paper presents a solution to the GenChal 2022 shared task dedicated to feedback comment generation for writing learning. In terms of this task given a text with an error and a span of the error, a system generates an explanatory note that helps the writer (language learner) to improve their writing skills. Our solution is based on fine-tuning the T5 model on the initial dataset augmented according to syntactical dependencies of the words located within indicated error span. The solution of our team ‘nigula’ obtained second place according to manual evaluation by the organizers.

pdf bib
A Report on FCG GenChal 2022: Shared Task on Feedback Comment Generation for Language Learners
Ryo Nagata | Masato Hagiwara | Kazuaki Hanawa | Masato Mita

We report on the results of the first ever shared task on feedback comment generation for language learners held as Generation Challenge (GenChal) in INLG 2022, which we call FCG GenChal. Feedback comment generation for language learners is a task where, given a text and a span, a system generates, for the span, an explanatory note that helps the writer (language learner) improve their writing skills. We show how well we can generate feedback comments with present techniques. We also shed light on the task properties and the difficulties in this task, with insights into the task including data development, evaluation, and comparisons of generation systems.

pdf bib
Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?
Shabnam Behzad | Amir Zeldes | Nathan Schneider

In this paper, we present strong baselines for the task of Feedback Comment Generation for Writing Learning. Given a sentence and an error span, the task is to generate a feedback comment explaining the error. Sentences and feedback comments are both in English. We experiment with LLMs and also create multiple pseudo datasets for the task, investigating how it affects the performance of our system. We present our results for the task along with extensive analysis of the generated comments with the aim of aiding future studies in feedback comment generation for English language learners.

pdf bib
Retrieval, Masking, and Generation: Feedback Comment Generation using Masked Comment Examples
Mana Ihori | Hiroshi Sato | Tomohiro Tanaka | Ryo Masumura

In this paper, we propose a novel method, retrieval, masking, and generation, for feedback comment generation. Feedback comment generation is a task in which a system generates feedback comments such as hints or explanatory notes for language learners, given input text and position showing where to comment. In the conventional study, the retrieve-and-edit method for retrieving feedback comments in the data pool and editing the comments has been thought effective for this task. However, the performance of this method does not perform as well as other conventional methods because its model learns to edit tokens that do not need to be rewritten in the retrieved comments. To mitigate this problem, we propose a method for combining retrieval, masking, and generation based on the retrieve-and-edit method. Specifically, tokens of feedback comments retrieved from the data pool are masked, and this masked feedback comment is used as a template to generate feedback comments. The proposed method should prevent unnecessary conversion by using not retrieved feedback comments directly but masking them. Our experiments on feedback comment generation demonstrate that the proposed method outperforms conventional methods.

pdf bib
TMU Feedback Comment Generation System Using Pretrained Sequence-to-Sequence Language Models
Naoya Ueda | Mamoru Komachi

In this paper, we introduce our Tokyo Metropolitan University Feedback Comment Generation system submitted to the feedback comment generation task for INLG 2023 Generation Challenge. In this task, a source sentence and offset range of preposition uses are given as the input. Then, a system generates hints or explanatory notes about preposition uses as the output. To tackle this generation task, we finetuned pretrained sequence-to-sequence language models. The models using BART and T5 showed significant improvement in BLEU score, demonstrating the effectiveness of the pretrained sequence-to-sequence language models in this task. We found that using part-of-speech tag information as an auxiliary input improves the generation quality of feedback comments. Furthermore, we adopt a simple postprocessing method that can enhance the reliability of the generation. As a result, our system achieved the F1 score of 47.4 points in BLEU-based evaluation and 60.9 points in manual evaluation, which ranked second and third on the leaderboard.

pdf bib
The Tokyo Tech and AIST System at the GenChal 2022 Shared Task on Feedback Comment Generation
Shota Koyama | Hiroya Takamura | Naoaki Okazaki

This paper describes the Tokyo Tech and AIST system in the GenChal 2022 shared task, which is the first shared task of feedback comment generation. We adopted five methods: data cleaning, fine-tuning pre-trained models, correcting errors in learners’ sentences, appending a correcting operation, and filtering out irrelevant outputs. Our system achieved F1 = 43.4 on the test dataset.

pdf bib
Feedback comment generation using predicted grammatical terms
Kunitaka Jimichi | Kotaro Funakoshi | Manabu Okumura

The purpose of feedback comment generation is to provide useful feedback comments for a wide range of errors in learners’ essays from a language learning perspective. Since it is difficult to obtain appropriate comments at a practical level with rule-based or retrieval- based methods, we explore neural-based gen- erative methods with pre-trained models. We further assume the effectiveness of consider- ing grammatical terms in generating feedback comments. Specifically, this paper proposes T5-based models using predicted grammati- cal terms, submitted to FCG GenChal, and presents their results. By using correct gram- matical terms, our model could improve the BLEU score by 19.0 points, compared with the baseline T5 without grammatical terms on the development dataset. Furthermore, by using predicted grammatical terms, our model could improve the manual evaluation score by 2.33 points, compared with the baseline T5 without grammatical terms on the test dataset.

pdf bib
AIWolfDial 2023: Summary of Natural Language Division of 5th International AIWolf Contest
Yoshinobu Kano | Neo Watanabe | Kaito Kagaminuma | Claus Aranha | Jaewon Lee | Benedek Hauer | Hisaichi Shibata | Soichiro Miki | Yuta Nakamura | Takuya Okubo | Soga Shigemura | Rei Ito | Kazuki Takashima | Tomoki Fukuda | Masahiro Wakutani | Tomoya Hatanaka | Mami Uchida | Mikio Abe | Akihiro Mikami | Takashi Otsuki | Zhiyang Qi | Kei Harada | Michimasa Inaba | Daisuke Katagami | Hirotaka Osawa | Fujio Toriumi

We held our 5th annual AIWolf international contest to automatically play the Werewolf game “Mafia”, where players try finding liars via conversations, aiming at promoting developments in creating agents of more natural conversations in higher level, such as longer contexts, personal relationships, semantics, pragmatics, and logics, revealing the capabilities and limits of the generative AIs. In our Natural Language Division of the contest, we had six Japanese speaking agents from five teams, and three English speaking agents, to mutually run games. By using the game logs, We performed human subjective evaluations and detailed log analysis. We found that the entire system performance has largely improved over the previous year, due to the recent advantages of the LLMs. However, it is not perfect at all yet; the generated talks are sometimes inconsistent with the game actions, it is still doubtful that the agents could infer roles by logics rather than superficial utterance generations. It is not explicitly observed in this log but it would be still difficult to make an agent telling a lie, pretend as a villager but it has an opposite goal inside. Our future work includes to reveal the capability of the LLMs, whether they can make the duality of the “liar”, in other words, holding a “true” and a “false” circumstances of the agent at the same time, even holding what these circumstances look like from other agents.

pdf bib
Team Zoom @ AutoMin 2023: Utilizing Topic Segmentation And LLM Data Augmentation For Long-Form Meeting Summarization
Felix Schneider | Marco Turchi

This paper describes Zoom’s submission to the Second Shared Task on Automatic Minuting at INLG 2023. We participated in Task A: generating abstractive summaries of meetings. Our final submission was a transformer model utilizing data from a similar domain and data augmentation by large language models, as well as content-based segmentation. The model produces summaries covering meeting topics and next steps and performs comparably to a large language model at a fraction of the cost. We also find that re-summarizing the summaries with the same model allows for an alternative, shorter summary.

pdf bib
Team Synapse @ AutoMin 2023: Leveraging BART-Based Models for Automatic Meeting Minuting
Kristýna Klesnilová | Michelle Elizabeth

This paper describes the approach we followed for our submission to the Second Run of the Automatic Minuting Shared Task. Our methodology centers around employing BART-based models fine-tuned on diverse summarization corpora. The segmented meeting transcripts are fed into the models, generating summaries that are subsequently combined and formatted into the final meeting minutes.

pdf bib
Team Iterate @ AutoMin 2023 - Experiments with Iterative Minuting
František Kmječ | Ondřej Bojar

This report describes the development of our system for automatic minuting created for the AutoMin 2023 Task A. As a baseline, we utilize a system based on the BART encoder-decoder model paired with a preprocessing pipeline similar to the one introduced by the winning solutions at AutoMin 2021. We then further explore the possibilities for iterative summarization by constructing an iterative minuting dataset from the provided data, finetuning on it and feeding the model previously generated minutes. We also experiment with adding more context by utilizing the Longformer encoder-decoder model and finetuning it on the SAMSum dataset. Our submitted solution is of the baseline approach, since we were unable to match its performance with our iterative variants. With the baseline, we achieve a ROUGE-1 score of 0.368 on the ELITR minuting corpus development set. We finally explore the performance of Vicuna 13B quantized language model for summarization.

pdf bib
Darbarer @ AutoMin2023: Transcription simplification for concise minute generation from multi-party conversations
Ismaël Rousseau | Loïc Fosse | Youness Dkhissi | Geraldine Damnati | Gwénolé Lecorvé

This document reports the approach of our team Darbarer for the main task (Task A) of the AutoMin 2023 challenge. Our system is composed of four main modules. The first module relies on a text simplification model aiming at standardizing the utterances of the conversation and compressing the input in order to focus on informative content. The second module handles summarization by employing a straightforward segmentation strategy and a fine-tuned BART-based generative model. Then a titling module has been trained in order to propose a short description of each summarized block. Lastly, we apply a post-processing step aimed at enhancing readability through specific formatting rules. Our contributions lie in the first, third and last steps. Our system generates precise and concise minutes. We provide a detailed description of our modules, discuss the difficulty of evaluating their impact and propose an analysis of observed errors in our generated minutes.

pdf bib
Team NTR @ AutoMin 2023: Dolly LLM Improves Minuting Performance, Semantic Segmentation Doesn’t
Eugene Borisov | Nikolay Mikhaylovskiy

This paper documents the approach of Team NTR for the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023. The goal of this work is to develop a module for automatic generation of meeting minutes based on a meeting transcript text produced by an Automated Speech Recognition (ASR) system (Task A). We consider minuting as a supervised machine learning task on pairs of texts: the transcript of the meeting and its minutes. We use a two-staged minuting pipeline that consists of segmentation and summarization. We experiment with semantic segmentation and multi-language approaches and Large Language Model Dolly, and achieve Rouge1-F of 0.2455 and BERT-Score of 0.8063 on the English part of ELITR test set and Rouge1-F of 0.2430 and BERT-Score of 0.8332 on the EuroParl dev set with the submitted Naive Segmentation + Dolly7b pipeline.

pdf bib
Overview of the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023
Tirthankar Ghosal | Ondřej Bojar | Marie Hledíková | Tom Kocmi | Anna Nedoluzhko

In this article, we report the findings of the second shared task on Automatic Minuting (AutoMin) held as a Generation Challenge at the 16th International Natural Language Generation (INLG) Conference 2023. The second Automatic Minuting shared task is a successor to the first AutoMin which took place in 2021. The primary objective of the AutoMin shared task is to garner participation of the speech and natural language processing and generation community to create automatic methods for generating minutes from multi-party meetings. Five teams from diverse backgrounds participated in the shared task this year. A lot has changed in the Generative AI landscape since the last AutoMin especially with the emergence and wide adoption of Large Language Models (LLMs) to different downstream tasks. Most of the contributions are based on some form of an LLM and we are also adding current outputs of GPT4 as a benchmark. Furthermore, we examine the applicability of GPT-4 for automatic scoring of minutes. Compared to the previous instance of AutoMin, we also add another domain, the minutes for EU Parliament sessions, and we experiment with a more fine-grained manual evaluation. More details on the event can be found at https://ufal.github.io/automin-2023/.

up

pdf (full)
bib (full)
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Svetlana Stoyanchev | Shafiq Joty | David Schlangen | Ondrej Dusek | Casey Kennington | Malihe Alikhani

pdf bib
Sources of Noise in Dialogue and How to Deal with Them
Derek Chen | Zhou Yu

Training dialogue systems often entails dealing with noisy training examples and unexpected user inputs. Despite their prevalence, there currently lacks an accurate survey of dialogue noise, nor is there a clear sense of the impact of each noise type on task performance. This paper addresses this gap by first constructing a taxonomy of noise encountered by dialogue systems. In addition, we run a series of experiments to show how different models behave when subjected to varying levels of noise and types of noise. Our results reveal that models are quite robust to label errors commonly tackled by existing denoising algorithms, but that performance suffers from dialogue-specific noise. Driven by these observations, we design a data cleaning algorithm specialized for conversational settings and apply it as a proof-of-concept for targeted dialogue denoising.

pdf bib
Investigating Explicitation of Discourse Connectives in Translation using Automatic Annotations
Frances Yung | Merel Scholman | Ekaterina Lapshinova-Koltunski | Christina Pollkläsener | Vera Demberg

Discourse relations have different patterns of marking across different languages. As a result, discourse connectives are often added, omitted, or rephrased in translation. Prior work has shown a tendency for explicitation of discourse connectives, but such work was conducted using restricted sample sizes due to difficulty of connective identification and alignment. The current study exploits automatic methods to facilitate a large-scale study of connectives in English and German parallel texts. Our results based on over 300 types and 18000 instances of aligned connectives and an empirical approach to compare the cross-lingual specificity gap provide strong evidence of the Explicitation Hypothesis. We conclude that discourse relations are indeed more explicit in translation than texts written originally in the same language. Automatic annotations allow us to carry out translation studies of discourse relations on a large scale. Our methodology using relative entropy to study the specificity of connectives also provides more fine-grained insights into translation patterns.

pdf bib
What’s Hard in English RST Parsing? Predictive Models for Error Analysis
Yang Janet Liu | Tatsuya Aoyama | Amir Zeldes

Despite recent advances in Natural Language Processing (NLP), hierarchical discourse parsing in the framework of Rhetorical Structure Theory remains challenging, and our understanding of the reasons for this are as yet limited. In this paper, we examine and model some of the factors associated with parsing difficulties in previous work: the existence of implicit discourse relations, challenges in identifying long-distance relations, out-of-vocabulary items, and more. In order to assess the relative importance of these variables, we also release two annotated English test-sets with explicit correct and distracting discourse markers associated with gold standard RST relations. Our results show that as in shallow discourse parsing, the explicit/implicit distinction plays a role, but that long-distance dependencies are the main challenge, while lack of lexical overlap is less of a problem, at least for in-domain parsing. Our final model is able to predict where errors will occur with an accuracy of 76.3% for the bottom-up parser and 76.6% for the top-down parser.

pdf bib
Grounded Complex Task Segmentation for Conversational Assistants
Rafael Ferreira | David Semedo | Joao Magalhaes

Following complex instructions in conversational assistants can be quite daunting due to the shorter attention and memory spans when compared to reading the same instructions. Hence, when conversational assistants walk users through the steps of complex tasks, there is a need to structure the task into manageable pieces of information of the right length and complexity. In this paper, we tackle the recipes domain and convert reading structured instructions into conversational structured ones. We annotated the structure of instructions according to a conversational scenario, which provided insights into what is expected in this setting. To computationally model the conversational step’s characteristics, we tested various Transformer-based architectures, showing that a token-based approach delivers the best results. A further user study showed that users tend to favor steps of manageable complexity and length, and that the proposed methodology can improve the original web-based instructional text. Specifically, 86% of the evaluated tasks were improved from a conversational suitability point of view.

pdf bib
A Statistical Approach for Quantifying Group Difference in Topic Distributions Using Clinical Discourse Samples
Grace O. Lawley | Peter A. Heeman | Jill K. Dolata | Eric Fombonne | Steven Bedrick

Topic distribution matrices created by topic models are typically used for document classification or as features in a separate machine learning algorithm. Existing methods for evaluating these topic distributions include metrics such as coherence and perplexity; however, there is a lack of statistically grounded evaluation tools. We present a statistical method for investigating group differences in the document-topic distribution vectors created by Latent Dirichlet Allocation (LDA) that uses Aitchison geometry to transform the vectors, multivariate analysis of variance (MANOVA) to compare sample means, and partial eta squared to calculate effect size. Using a corpus of dialogues between Autistic and Typically Developing (TD) children and trained examiners, we found that the topic distributions of Autistic children differed from those of TD children when responding to questions about social difficulties (p = .0083, partial eta squared = .19). Furthermore, the examiners’ topic distributions differed between the Autistic and TD groups when discussing emotions (p = .0035, partial eta squared = .20), social difficulties (p < .001, partial eta squared = .30), and friends (p = .0224, partial eta squared = .17). These results support the use of topic modeling in studying clinically relevant features of social communication such as topic maintenance.

pdf bib
OpinionConv: Conversational Product Search with Grounded Opinions
Vahid Sadiri Javadi | Martin Potthast | Lucie Flek

When searching for products, the opinions of others play an important role in making informed decisions. Subjective experiences about a product can be a valuable source of information. This is also true in sales conversations, where a customer and a sales assistant exchange facts and opinions about products. However, training an AI for such conversations is complicated by the fact that language models do not possess authentic opinions for their lack of real-world experience. We address this problem by leveraging product reviews as a rich source of product opinions to ground conversational AI in true subjective narratives. With OpinionConv, we develop the first conversational AI for simulating sales conversations. To validate the generated conversations, we conduct several user studies showing that the generated opinions are perceived as realistic. Our assessors also confirm the importance of opinions as an informative basis for decision making.

pdf bib
Dial-M: A Masking-based Framework for Dialogue Evaluation
Suvodip Dey | Maunendra Sankar Desarkar

In dialogue systems, automatically evaluating machine-generated responses is critical and challenging. Despite the tremendous progress in dialogue generation research, its evaluation heavily depends on human judgments. The standard word-overlapping based evaluation metrics are ineffective for dialogues. As a result, most of the recently proposed metrics are model-based and reference-free, which learn to score different aspects of a conversation. However, understanding each aspect requires a separate model, which makes them computationally expensive. To this end, we propose Dial-M, a Masking-based reference-free framework for Dialogue evaluation. The main idea is to mask the keywords of the current utterance and predict them, given the dialogue history and various conditions (like knowledge, persona, etc.), thereby making the evaluation framework simple and easily extensible for multiple datasets. Regardless of its simplicity, Dial-M achieves comparable performance to state-of-the-art metrics on several dialogue evaluation datasets. We also discuss the interpretability of our proposed metric along with error analysis.

pdf bib
From Chatter to Matter: Addressing Critical Steps of Emotion Recognition Learning in Task-oriented Dialogue
Shutong Feng | Nurul Lubis | Benjamin Ruppik | Christian Geishauser | Michael Heck | Hsien-chin Lin | Carel van Niekerk | Renato Vukovic | Milica Gasic

Emotion recognition in conversations (ERC) is a crucial task for building human-like conversational agents. While substantial efforts have been devoted to ERC for chit-chat dialogues, the task-oriented counterpart is largely left unattended. Directly applying chit-chat ERC models to task-oriented dialogues (ToDs) results in suboptimal performance as these models overlook key features such as the correlation between emotions and task completion in ToDs. In this paper, we propose a framework that turns a chit-chat ERC model into a task-oriented one, addressing three critical aspects: data, features and objective. First, we devise two ways of augmenting rare emotions to improve ERC performance. Second, we use dialogue states as auxiliary features to incorporate key information from the goal of the user. Lastly, we leverage a multi-aspect emotion definition in ToDs to devise a multi-task learning objective and a novel emotion-distance weighted loss function. Our framework yields significant improvements for a range of chit-chat ERC models on EmoWOZ, a large-scale dataset for user emotions in ToDs. We further investigate the generalisability of the best resulting model to predict user satisfaction in different ToD datasets. A comparison with supervised baselines shows a strong zero-shot capability, highlighting the potential usage of our framework in wider scenarios.

pdf bib
Analyzing Differences in Subjective Annotations by Participants and Third-party Annotators in Multimodal Dialogue Corpus
Kazunori Komatani | Ryu Takeda | Shogo Okada

Estimating the subjective impressions of human users during a dialogue is necessary when constructing a dialogue system that can respond adaptively to their emotional states. However, such subjective impressions (e.g., how much the user enjoys the dialogue) are inherently ambiguous, and the annotation results provided by multiple annotators do not always agree because they depend on the subjectivity of the annotators. In this paper, we analyzed the annotation results using 13,226 exchanges from 155 participants in a multimodal dialogue corpus called Hazumi that we had constructed, where each exchange was annotated by five third-party annotators. We investigated the agreement between the subjective annotations given by the third-party annotators and the participants themselves, on both per-exchange annotations (i.e., participant’s sentiments) and per-dialogue (-participant) annotations (i.e., questionnaires on rapport and personality traits). We also investigated the conditions under which the annotation results are reliable. Our findings demonstrate that the dispersion of third-party sentiment annotations correlates with agreeableness of the participants, one of the Big Five personality traits.

pdf bib
Frame-oriented Summarization of Argumentative Discussions
Shahbaz Syed | Timon Ziegenbein | Philipp Heinisch | Henning Wachsmuth | Martin Potthast

Online discussions on controversial topics with many participants frequently include hundreds of arguments that cover different framings of the topic. But these arguments and frames are often spread across the various branches of the discussion tree structure. This makes it difficult for interested participants to follow the discussion in its entirety as well as to introduce new arguments. In this paper, we present a new rank-based approach to extractive summarization of online discussions focusing on argumentation frames that capture the different aspects of a discussion. Our approach includes three retrieval tasks to find arguments in a discussion that are (1) relevant to a frame of interest, (2) relevant to the topic under discussion, and (3) informative to the reader. Based on a joint ranking by these three criteria for a set of user-selected frames, our approach allows readers to quickly access an ongoing discussion. We evaluate our approach using a test set of 100 controversial Reddit ChangeMyView discussions, for which the relevance of a total of 1871 arguments was manually annotated.

pdf bib
Towards Multilingual Automatic Open-Domain Dialogue Evaluation
John Mendonca | Alon Lavie | Isabel Trancoso

The main limiting factor in the development of robust multilingual open-domain dialogue evaluation metrics is the lack of multilingual data and the limited availability of open-sourced multilingual dialogue systems. In this work, we propose a workaround for this lack of data by leveraging a strong multilingual pretrained encoder-based Language Model and augmenting existing English dialogue data using Machine Translation. We empirically show that the naive approach of finetuning a pretrained multilingual encoder model with translated data is insufficient to outperform the strong baseline of finetuning a multilingual model with only source data. Instead, the best approach consists in the careful curation of translated data using MT Quality Estimation metrics, excluding low quality translations that hinder its performance.

pdf bib
Dialog Action-Aware Transformer for Dialog Policy Learning
Huimin Wang | Wai Chung Kwan | Kam-Fai Wong

Recent works usually address Dialog policy learning DPL by training a reinforcement learning (RL) agent to determine the best dialog action. However, existing works on deep RL require a large volume of agent-user interactions to achieve acceptable performance. In this paper, we propose to make full use of the plain text knowledge from the pre-trained language model to accelerate the RL agent’s learning speed. Specifically, we design a dialog action-aware transformer encoder (DaTrans), which integrates a new fine-tuning procedure named masked last action task to encourage DaTrans to be dialog-aware and distill action-specific features. Then, DaTrans is further optimized in an RL setting with ongoing interactions and evolves through exploration in the dialog action space toward maximizing long-term accumulated rewards. The effectiveness and efficiency of the proposed model are demonstrated with both simulator evaluation and human evaluation.

pdf bib
The Wizard of Curiosities: Enriching Dialogues with Fun Facts
Frederico Vicente | Rafael Ferreira | David Semedo | Joao Magalhaes

Introducing curiosities in a conversation is a way to teach something new to the person in a pleasant and enjoyable way. Enriching dialogues with contextualized curiosities can improve the users’ perception of a dialog system and their overall user experience. In this paper, we introduce a set of curated curiosities, targeting dialogues in the cooking and DIY domains. In particular, we use real human-agent conversations collected in the context of the Amazon Alexa TaskBot challenge, a multimodal and multi-turn conversational setting. According to an A/B test with over 1000 conversations, curiosities not only increase user engagement, but provide an average relative rating improvement of 9.7%.

pdf bib
The Road to Quality is Paved with Good Revisions: A Detailed Evaluation Methodology for Revision Policies in Incremental Sequence Labelling
Brielen Madureira | Patrick Kahardipraja | David Schlangen

Incremental dialogue model components produce a sequence of output prefixes based on incoming input. Mistakes can occur due to local ambiguities or to wrong hypotheses, making the ability to revise past outputs a desirable property that can be governed by a policy. In this work, we formalise and characterise edits and revisions in incremental sequence labelling and propose metrics to evaluate revision policies. We then apply our methodology to profile the incremental behaviour of three Transformer-based encoders in various tasks, paving the road for better revision policies.

pdf bib
The effect of conversation type on entrainment: Evidence from laughter
Bogdan Ludusan | Petra Wagner

Entrainment is a phenomenon that occurs across several modalities and at different linguistic levels in conversation. Previous work has shown that its effects may be modulated by conversation extrinsic factors, such as the relation between the interlocutors or the speakers’ traits. The current study investigates the role of conversation type on laughter entrainment. Employing dyadic interaction materials in German, containing two conversation types (free dialogues and task-based interactions), we analyzed three measures of entrainment previously proposed in the literature. The results show that the entrainment effects depend on the type of conversation, with two of the investigated measures being affected by this factor. These findings represent further evidence towards the role of situational aspects as a mediating factor in conversation.

pdf bib
‘What are you referring to?’ Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges
Javier Chiyah-Garcia | Alessandro Suglia | Arash Eshghi | Helen Hastie

Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarificational Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models. We use the SIMMC 2.0 dataset to evaluate the ability of different state-of-the-art model architectures to process CEs, with a metric that probes the contextual updates that arise from them in the model. We find that language-based models are able to encode simple multi-modal semantic information and process some CEs, excelling with those related to the dialogue history, whilst multi-modal models can use additional learning objectives to obtain disentangled object representations, which become crucial to handle complex referential ambiguities across modalities overall.

pdf bib
PGTask: Introducing the Task of Profile Generation from Dialogues
Rui Ribeiro | Joao Paulo Carvalho | Luisa Coheur

Recent approaches have attempted to personalize dialogue systems by leveraging profile information into models. However, this knowledge is scarce and difficult to obtain, which makes the extraction/generation of profile information from dialogues a fundamental asset. To surpass this limitation, we introduce the Profile Generation Task (PGTask). We contribute with a new dataset for this problem, comprising profile sentences aligned with related utterances, extracted from a corpus of dialogues. Furthermore, using state-of-the-art methods, we provide a benchmark for profile generation on this novel dataset. Our experiments disclose the challenges of profile generation, and we hope that this introduces a new research direction.

pdf bib
Question Generation to Elicit Users’ Food Preferences by Considering the Semantic Content
Jie Zeng | Yukiko Nakano | Tatsuya Sakato

To obtain a better understanding of user preferences in providing tailored services, dialogue systems have to generate semi-structured interviews that require flexible dialogue control while following a topic guide to accomplish the purpose of the interview. Toward this goal, this study proposes a semantics-aware GPT-3 fine-tuning model that generates interviews to acquire users’ food preferences. The model was trained using dialogue history and semantic representation constructed from the communicative function and semantic content of the utterance. Using two baseline models: zero-shot ChatGPT and fine-tuned GPT-3, we conducted a user study for subjective evaluations alongside automatic objective evaluations. In the user study, in impression rating, the outputs of the proposed model were superior to those of baseline models and comparable to real human interviews in terms of eliciting the interviewees’ food preferences.

pdf bib
Roll Up Your Sleeves: Working with a Collaborative and Engaging Task-Oriented Dialogue System
Lingbo Mo | Shijie Chen | Ziru Chen | Xiang Deng | Ashley Lewis | Sunit Singh | Samuel Stevens | Chang-You Tai | Zhen Wang | Xiang Yue | Tianshu Zhang | Yu Su | Huan Sun

We introduce TacoBot, a user-centered task-oriented digital assistant designed to guide users through complex real-world tasks with multiple steps. Covering a wide range of cooking and how-to tasks, we aim to deliver a collaborative and engaging dialogue experience. Equipped with language understanding, dialogue management, and response generation components supported by a robust search engine, TacoBot ensures efficient task assistance. To enhance the dialogue experience, we explore a series of data augmentation strategies using LLMs to train advanced neural models continuously. TacoBot builds upon our successful participation in the inaugural Alexa Prize TaskBot Challenge, where our team secured third place among ten competing teams. We offer TacoBot as an open-source framework that serves as a practical example for deploying task-oriented dialogue systems.

pdf bib
Leveraging Large Language Models for Automated Dialogue Analysis
Sarah E. Finch | Ellie S. Paek | Jinho D. Choi

Developing high-performing dialogue systems benefits from the automatic identification of undesirable behaviors in system responses. However, detecting such behaviors remains challenging, as it draws on a breadth of general knowledge and understanding of conversational practices. Although recent research has focused on building specialized classifiers for detecting specific dialogue behaviors, the behavior coverage is still incomplete and there is a lack of testing on real-world human-bot interactions. This paper investigates the ability of a state-of-the-art large language model (LLM), ChatGPT-3.5, to perform dialogue behavior detection for nine categories in real human-bot dialogues. We aim to assess whether ChatGPT can match specialized models and approximate human performance, thereby reducing the cost of behavior detection tasks. Our findings reveal that neither specialized models nor ChatGPT have yet achieved satisfactory results for this task, falling short of human performance. Nevertheless, ChatGPT shows promising potential and often outperforms specialized detection models. We conclude with an in-depth examination of the prevalent shortcomings of ChatGPT, offering guidance for future research to enhance LLM capabilities.

pdf bib
Are Large Language Models All You Need for Task-Oriented Dialogue?
Vojtěch Hudeček | Ondrej Dusek

Instruction-finetuned large language models (LLMs) gained a huge popularity recently, thanks to their ability to interact with users through conversation. In this work, we aim to evaluate their ability to complete multi-turn tasks and interact with external databases in the context of established task-oriented dialogue benchmarks. We show that in explicit belief state tracking, LLMs underperform compared to specialized task-specific models. Nevertheless, they show some ability to guide the dialogue to a successful ending through their generated responses if they are provided with correct slot values. Furthermore, this ability improves with few-shot in-domain examples.

pdf bib
Multi-party Goal Tracking with LLMs: Comparing Pre-training, Fine-tuning, and Prompt Engineering
Angus Addlesee | Weronika Sieińska | Nancie Gunson | Daniel Hernandez Garcia | Christian Dondrup | Oliver Lemon

This paper evaluates the extent to which current LLMs can capture task-oriented multi-party conversations (MPCs). We have recorded and transcribed 29 MPCs between patients, their companions, and a social robot in a hospital. We then annotated this corpus for multi-party goal-tracking and intent-slot recognition. People share goals, answer each other’s goals, and provide other people’s goals in MPCs - none of which occur in dyadic interactions. To understand user goals in MPCs, we compared three methods in zero-shot and few-shot settings: we fine-tuned T5, created pre-training tasks to train DialogLM using LED, and employed prompt engineering techniques with GPT-3.5-turbo, to determine which approach can complete this novel task with limited data. GPT-3.5-turbo significantly outperformed the others in a few-shot setting. The ‘reasoning’ style prompt, when given 7% of the corpus as example annotated conversations, was the best performing method. It correctly annotated 62.32% of the goal tracking MPCs, and 69.57% of the intent-slot recognition MPCs. A ‘story’ style prompt increased model hallucination, which could be detrimental if deployed in safety-critical settings. We conclude that multi-party conversations still challenge state-of-the-art LLMs.

pdf bib
ChatGPT vs. Crowdsourcing vs. Experts: Annotating Open-Domain Conversations with Speech Functions
Lidiia Ostyakova | Veronika Smilga | Kseniia Petukhova | Maria Molchanova | Daniel Kornev

This paper deals with the task of annotating open-domain conversations with speech functions. We propose a semi-automated method for annotating dialogs following the topic-oriented, multi-layered taxonomy of speech functions with the use of hierarchical guidelines using Large Language Models. These guidelines comprise simple questions about the topic and speaker change, sentence types, pragmatic aspects of the utterance, and examples that aid untrained annotators in understanding the taxonomy. We compare the results of dialog annotation performed by experts, crowdsourcing workers, and ChatGPT. To improve the performance of ChatGPT, several experiments utilising different prompt engineering techniques were conducted. We demonstrate that in some cases large language models can achieve human-like performance following a multi-step tree-like annotation pipeline on complex discourse annotation, which is usually challenging and costly in terms of time and money when performed by humans.

pdf bib
DiactTOD: Learning Generalizable Latent Dialogue Acts for Controllable Task-Oriented Dialogue Systems
Qingyang Wu | James Gung | Raphael Shu | Yi Zhang

Dialogue act annotations are important to improve response generation quality in task-oriented dialogue systems. However, it can be challenging to use dialogue acts to control response generation in a generalizable way because different datasets and tasks may have incompatible annotations. While alternative methods that utilize latent action spaces or reinforcement learning do not require explicit annotations, they may lack interpretability or face difficulties defining task-specific rewards. In this work, we present a novel end-to-end latent dialogue act model (DiactTOD) that represents dialogue acts in a latent space. DiactTOD, when pre-trained on a large corpus, is able to predict and control dialogue acts to generate controllable responses using these latent representations in a zero-shot fashion. Our approach demonstrates state-of-the-art performance across a wide range of experimental settings on the MultiWOZ dataset, including zero-shot, few-shot, and full data fine-tuning with both end-to-end and policy optimization configurations.

pdf bib
Approximating Online Human Evaluation of Social Chatbots with Prompting
Ekaterina Svikhnushina | Pearl Pu

With conversational models becoming increasingly available to the general public, developing scalable and robust evaluation metrics is crucial to minimize potential social and psychological risks for the users. Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs. However, they are limited in their ability to capture subjective perceptions of users who actually interact with the chatbots and might not generalize to real-world settings. To address this limitation, we propose an approach to approximate online human evaluation, leveraging large language models (LLMs) from the GPT-family. We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline that replicates live user studies and achieves an impressive correlation with human judgment (up to Pearson r=0.95 on a system level). The DEP approach involves collecting synthetic chat logs of evaluated bots with an LLM in the other-play setting, where the LLM is carefully conditioned to follow a specific scenario. We further explore different prompting approaches to produce evaluation scores with the same LLM. The best-performing prompts, which contain few-shot demonstrations and instructions, show outstanding performance on the tested dataset and demonstrate the ability to generalize to other dialog corpora.

pdf bib
Dialogue Response Generation Using Completion of Omitted Predicate Arguments Based on Zero Anaphora Resolution
Ayaka Ueyama | Yoshinobu Kano

Human conversation attempts to build common ground consisting of shared beliefs, knowledge, and perceptions that form the premise for understanding utterances. Recent deep learning-based dialogue systems use human dialogue data to train a mapping from a dialogue history to responses, but common ground not directly expressed in words makes it difficult to generate coherent responses by learning statistical patterns alone. We propose Dialogue Completion using Zero Anaphora Resolution (DCZAR), a framework that explicitly completes omitted information in the dialogue history and generates responses from the completed dialogue history. In this study, we conducted automatic and human evaluations by applying several pretraining methods and datasets in Japanese in various combinations. Experimental results show that the DCZAR framework contributes to the generation of more coherent and engaging responses.

pdf bib
Syndicom: Improving Conversational Commonsense with Error-Injection and Natural Language Feedback
Christopher Richardson | Anirudh Sundar | Larry Heck

Commonsense reasoning is a critical aspect of human communication. Despite recent advances in conversational AI driven by large language models, commonsense reasoning remains a challenging task. In this work, we introduce Syndicom - a method for improving commonsense in dialogue response generation. Syndicom consists of two components. The first component is a dataset composed of commonsense dialogues created from a knowledge graph and synthesized into natural language. This dataset includes both valid and invalid responses to dialogue contexts, along with natural language feedback (NLF) for the invalid responses. The second contribution is a two-step procedure: training a model to predict natural language feedback (NLF) for invalid responses, and then training a response generation model conditioned on the predicted NLF, the invalid response, and the dialogue. Syndicom is scalable and does not require reinforcement learning. Empirical results on three tasks are evaluated using a broad range of metrics. Syndicom achieves a relative improvement of 53% over ChatGPT on ROUGE-1, and human evaluators prefer Syndicom over ChatGPT 57% of the time. We will publicly release the code and the full dataset.

pdf bib
“What do others think?”: Task-Oriented Conversational Modeling with Subjective Knowledge
Chao Zhao | Spandana Gella | Seokhwan Kim | Di Jin | Devamanyu Hazarika | Alexandros Papangelis | Behnam Hedayatnia | Mahdi Namazifar | Yang Liu | Dilek Hakkani-Tur

Task-oriented Dialogue (TOD) Systems aim to build dialogue systems that assist users in accomplishing specific goals, such as booking a hotel or a restaurant. Traditional TODs rely on domain-specific APIs/DBs or external factual knowledge to generate responses, which cannot accommodate subjective user requests (e.g.,”Is the WIFI reliable?” or “Does the restaurant have a good atmosphere?”). To address this issue, we propose a novel task of subjective-knowledge-based TOD (SK-TOD). We also propose the first corresponding dataset, which contains subjective knowledge-seeking dialogue contexts and manually annotated responses grounded in subjective knowledge sources. When evaluated with existing TOD approaches, we find that this task poses new challenges such as aggregating diverse opinions from multiple knowledge snippets. We hope this task and dataset can promote further research on TOD and subjective content understanding. The code and the dataset are available at https://github.com/alexa/dstc11-track5.

pdf bib
UD_Japanese-CEJC: Dependency Relation Annotation on Corpus of Everyday Japanese Conversation
Mai Omura | Hiroshi Matsuda | Masayuki Asahara | Aya Wakasa

In this study, we have developed Universal Dependencies (UD) resources for spoken Japanese in the Corpus of Everyday Japanese Conversation (CEJC). The CEJC is a large corpus of spoken language that encompasses various everyday conversations in Japanese, and includes word delimitation and part-of-speech annotation. We have newly annotated Long Word Unit delimitation and Bunsetsu (Japanese phrase)-based dependencies, including Bunsetsu boundaries, for CEJC. The UD of Japanese resources was constructed in accordance with hand-maintained conversion rules from the CEJC with two types of word delimitation, part-of-speech tags and Bunsetsu-based syntactic dependency relations. Furthermore, we examined various issues pertaining to the construction of UD in the CEJC by comparing it with the written Japanese corpus and evaluating UD parsing accuracy.

pdf bib
Unravelling Indirect Answers to Wh-Questions: Corpus Construction, Analysis, and Generation
Zulipiye Yusupujiang | Jonathan Ginzburg

Indirect answers, crucial in human communication, serve to maintain politeness, avoid conflicts, and align with social customs. Although there has been a substantial number of studies on recognizing and understanding indirect answers to polar questions (often known as yes/no questions), there is a dearth of such work regarding wh-questions. This study takes up the challenge by constructing what is, to our knowledge, the first corpus of indirect answers to wh-questions. We analyze and interpret indirect answers to different wh-questions based on our carefully compiled corpus. In addition, we conducted a pilot study on generating indirect answers to wh-questions by fine-tuning the pre-trained generative language model DialoGPT (Zhang et al., 2020). Our results suggest this is a task that GPT finds difficult.

pdf bib
A New Dataset for Causality Identification in Argumentative Texts
Khalid Al Khatib | Michael Voelske | Anh Le | Shahbaz Syed | Martin Potthast | Benno Stein

Existing datasets for causality identification in argumentative texts have several limitations, such as the type of input text (e.g., only claims), causality type (e.g., only positive), and the linguistic patterns investigated (e.g., only verb connectives). To resolve these limitations, we build the Webis-Causality-23 dataset, with sophisticated inputs (all units from arguments), a balanced distribution of causality types, and a larger number of linguistic patterns denoting causality. The dataset contains 1485 examples derived by combining the two paradigms of distant supervision and uncertainty sampling to identify diverse, high-quality samples of causality relations, and annotate them in a cost-effective manner.

pdf bib
Controllable Generation of Dialogue Acts for Dialogue Systems via Few-Shot Response Generation and Ranking
Angela Ramirez | Kartik Agarwal | Juraj Juraska | Utkarsh Garg | Marilyn Walker

Dialogue systems need to produce responses that realize multiple types of dialogue acts (DAs) with high semantic fidelity. In the past, natural language generators (NLGs) for dialogue were trained on large parallel corpora that map from a domain-specific DA and its semantic attributes to an output utterance. Recent work shows that pretrained language models (LLMs) offer new possibilities for controllable NLG using prompt-based learning. Here we develop a novel few-shot overgenerate-and-rank approach that achieves the controlled generation of DAs. We compare eight few-shot prompt styles that include a novel method of generating from textual pseudo-references using a textual style transfer approach. We develop six automatic ranking functions that identify outputs with both the correct DA and high semantic accuracy at generation time. We test our approach on three domains and four LLMs. To our knowledge, this is the first work on NLG for dialogue that automatically ranks outputs using both DA and attribute accuracy. For completeness, we compare our results to fine-tuned few-shot models trained with 5 to 100 instances per DA. Our results show that several prompt settings achieve perfect DA accuracy, and near perfect semantic accuracy (99.81%) and perform better than few-shot fine-tuning.

pdf bib
Reference Resolution and New Entities in Exploratory Data Visualization: From Controlled to Unconstrained Interactions with a Conversational Assistant
Abari Bhattacharya | Abhinav Kumar | Barbara Di Eugenio | Roderick Tabalba | Jillian Aurisano | Veronica Grosso | Andrew Johnson | Jason Leigh | Moira Zellner

In the context of data visualization, as in other grounded settings, referents are created by the task the agents engage in and are salient because they belong to the shared physical setting. Our focus is on resolving references to visualizations on large displays; crucially, reference resolution is directly involved in the process of creating new entities, namely new visualizations. First, we developed a reference resolution model for a conversational assistant. We trained the assistant on controlled dialogues for data visualizations involving a single user. Second, we ported the conversational assistant including its reference resolution model to a different domain, supporting two users collaborating on a data exploration task. We explore how the new setting affects reference detection and resolution; we compare the performance in the controlled vs unconstrained setting, and discuss the general lessons that we draw from this adaptation.

pdf bib
CONVERSER: Few-shot Conversational Dense Retrieval with Synthetic Data Generation
Chao-Wei Huang | Chen-Yu Hsu | Tsu-Yuan Hsu | Chen-An Li | Yun-Nung Chen

Conversational search provides a natural interface for information retrieval (IR). Recent approaches have demonstrated promising results in applying dense retrieval to conversational IR. However, training dense retrievers requires large amounts of in-domain paired data. This hinders the development of conversational dense retrievers, as abundant in-domain conversations are expensive to collect. In this paper, we propose Converser, a framework for training conversational dense retrievers with at most 6 examples of in-domain dialogues. Specifically, we utilize the in-context learning capability of large language models to generate conversational queries given a passage in the retrieval corpus. Experimental results on conversational retrieval benchmarks OR-QuAC and TREC CAsT 19 show that the proposed Converser achieves comparable performance to fully-supervised models, demonstrating the effectiveness of our proposed framework in few-shot conversational dense retrieval. All source code and generated datasets are available: https://github.com/MiuLab/CONVERSER

pdf bib
Speaker Role Identification in Call Centre Dialogues: Leveraging Opening Sentences and Large Language Models
Minh-Quoc Nghiem | Nichola Roberts | Dmitry Sityaev

This paper addresses the task of speaker role identification in call centre dialogues, focusing on distinguishing between the customer and the agent. We propose a text-based approach that utilises the identification of the agent’s opening sentence as a key feature for role classification. The opening sentence is identified using a model trained through active learning. By combining this information with a large language model, we accurately classify the speaker roles. The proposed approach is evaluated on a dataset of call centre dialogues and achieves 93.61% accuracy. This work contributes to the field by providing an effective solution for speaker role identification in call centre settings, with potential applications in interaction analysis and information retrieval.

pdf bib
Synthesising Personality with Neural Speech Synthesis
Shilin Gao | Matthew P. Aylett | David A. Braude | Catherine Lai

Matching the personality of conversational agent to the personality of the user can significantly improve the user experience, with many successful examples in text-based chatbots. It is also important for a voice-based system to be able to alter the personality of the speech as perceived by the users. In this pilot study, fifteen voices were rated using Big Five personality traits. Five content-neutral sentences were chosen for the listening tests. The audio data, together with two rated traits (Extroversion and Agreeableness), were used to train a neural speech synthesiser based on one male and one female voices. The effect of altering the personality trait features was evaluated by a second listening test. Both perceived extroversion and agreeableness in the synthetic voices were affected significantly. The controllable range was limited due to a lack of variance in the source audio data. The perceived personality traits correlated with each other and with the naturalness of the speech. Future work can be making a chatbot speak in a voice with a pre-defined or adaptive personality by using personality synthesis in speech together with text-based personality generation.

pdf bib
Prompting, Retrieval, Training: An exploration of different approaches for task-oriented dialogue generation
Gonçalo Raposo | Luisa Coheur | Bruno Martins

Task-oriented dialogue systems need to generate appropriate responses to help fulfill users’ requests. This paper explores different strategies, namely prompting, retrieval, and fine-tuning, for task-oriented dialogue generation. Through a systematic evaluation, we aim to provide valuable insights and guidelines for researchers and practitioners working on developing efficient and effective dialogue systems for real-world applications. Evaluation is performed on the MultiWOZ and Taskmaster-2 datasets, and we test various versions of FLAN-T5, GPT-3.5, and GPT-4 models. Costs associated with running these models are analyzed, and dialogue evaluation is briefly discussed. Our findings suggest that when testing data differs from the training data, fine-tuning may decrease performance, favoring a combination of a more general language model and a prompting mechanism based on retrieved examples.

pdf bib
Bootstrapping a Conversational Guide for Colonoscopy Prep
Pulkit Arya | Madeleine Bloomquist | Subhankar Chakraborty | Andrew Perrault | William Schuler | Eric Fosler-Lussier | Michael White

Creating conversational systems for niche domains is a challenging task, further exacerbated by a lack of quality datasets. We explore the construction of safer conversational systems for guiding patients in preparing for colonoscopies. This has required a data generation pipeline to generate a minimum viable dataset to bootstrap a semantic parser, augmented by automatic paraphrasing. Our study suggests large language models (e.g., GPT-3.5 and GPT-4) are a viable alternative to crowd sourced paraphrasing, but conversational systems that rely upon language models’ ability to do temporal reasoning struggle to provide accurate responses. A neural-symbolic system that performs temporal reasoning on an intermediate representation of user queries shows promising results compared to an end-to-end dialogue system, improving the number of correct responses while vastly reducing the number of incorrect or misleading ones.

pdf bib
Applying Item Response Theory to Task-oriented Dialogue Systems for Accurately Determining User’s Task Success Ability
Ryu Hirai | Ao Guo | Ryuichiro Higashinaka

While task-oriented dialogue systems have improved, not all users can fully accomplish their tasks. Users with limited knowledge about the system may experience dialogue breakdowns or fail to achieve their tasks because they do not know how to interact with the system. For addressing this issue, it would be desirable to construct a system that can estimate the user’s task success ability and adapt to that ability. In this study, we propose a method that estimates this ability by applying item response theory (IRT), commonly used in education for estimating examinee abilities, to task-oriented dialogue systems. Through experiments predicting the probability of a correct answer to each slot by using the estimated task success ability, we found that the proposed method significantly outperformed baselines.

pdf bib
An Open-Domain Avatar Chatbot by Exploiting a Large Language Model
Takato Yamazaki | Tomoya Mizumoto | Katsumasa Yoshikawa | Masaya Ohagi | Toshiki Kawamoto | Toshinori Sato

With the ambition to create avatars capable of human-level casual conversation, we developed an open-domain avatar chatbot, situated in a virtual reality environment, that employs a large language model (LLM). Introducing the LLM posed several challenges for multimodal integration, such as developing techniques to align diverse outputs and avatar control, as well as addressing the issue of slow generation speed. To address these challenges, we integrated various external modules into our system. Our system is based on the award-winning model from the Dialogue System Live Competition 5. Through this work, we hope to stimulate discussions within the research community about the potential and challenges of multimodal dialogue systems enhanced with LLMs.

pdf bib
Learning Multimodal Cues of Children’s Uncertainty
Qi Cheng | Mert Inan | Rahma Mbarki | Grace Grmek | Theresa Choi | Yiming Sun | Kimele Persaud | Jenny Wang | Malihe Alikhani

Understanding uncertainty plays a critical role in achieving common ground (Clark et al., 1983). This is especially important for multimodal AI systems that collaborate with users to solve a problem or guide the user through a challenging concept. In this work, for the first time, we present a dataset annotated in collaboration with developmental and cognitive psychologists for the purpose of studying nonverbal cues of uncertainty. We then present an analysis of the data, studying different roles of uncertainty and its relationship with task difficulty and performance. Lastly, we present a multimodal machine learning model that can predict uncertainty given a real-time video clip of a participant, which we find improves upon a baseline multimodal transformer model. This work informs research on cognitive coordination between human-human and human-AI and has broad implications for gesture understanding and generation. The anonymized version of our data and code will be publicly available upon the completion of the required consent forms and data sheets.

pdf bib
Grounding Description-Driven Dialogue State Trackers with Knowledge-Seeking Turns
Alexandru Coca | Bo-Hsiang Tseng | Jinghong Chen | Weizhe Lin | Weixuan Zhang | Tisha Anders | Bill Byrne

Schema-guided dialogue state trackers can generalise to new domains without further training, yet they are sensitive to the writing style of the schemata. Augmenting the training set with human or synthetic schema paraphrases improves the model robustness to these variations but can be either costly or difficult to control. We propose to circumvent these issues by grounding the state tracking model in knowledge-seeking turns collected from the dialogue corpus as well as the schema. Including these turns in prompts during finetuning and inference leads to marked improvements in model robustness, as demonstrated by large average joint goal accuracy and schema sensitivity improvements on SGD and SGD-X.

pdf bib
Resolving References in Visually-Grounded Dialogue via Text Generation
Bram Willemsen | Livia Qian | Gabriel Skantze

Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.

pdf bib
Slot Induction via Pre-trained Language Model Probing and Multi-level Contrastive Learning
Hoang Nguyen | Chenwei Zhang | Ye Liu | Philip Yu

Recent advanced methods in Natural Language Understanding for Task-oriented Dialogue (TOD) Systems (e.g., intent detection and slot filling) require a large amount of annotated data to achieve competitive performance. In reality, token-level annotations (slot labels) are time-consuming and difficult to acquire. In this work, we study the Slot Induction (SI) task whose objective is to induce slot boundaries without explicit knowledge of token-level slot annotations. We propose leveraging Unsupervised Pre-trained Language Model (PLM) Probing and Contrastive Learning mechanism to exploit (1) unsupervised semantic knowledge extracted from PLM, and (2) additional sentence-level intent label signals available from TOD. Our approach is shown to be effective in SI task and capable of bridging the gaps with token-level supervised models on two NLU benchmark datasets. When generalized to emerging intents, our SI objectives also provide enhanced slot label representations, leading to improved performance on the Slot Filling tasks.

pdf bib
The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems
Andreas Liesenfeld | Alianda Lopez | Mark Dingemanse

Speech recognition systems are a key intermediary in voice-driven human-computer interaction. Although speech recognition works well for pristine monologic audio, real-life use cases in open-ended interactive settings still present many challenges. We argue that timing is mission-critical for dialogue systems, and evaluate 5 major commercial ASR systems for their conversational and multilingual support. We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). This impacts especially the recognition of conversational words (study 2), and in turn has dire consequences for downstream intent recognition (study 3). Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.

pdf bib
Enhancing Task Bot Engagement with Synthesized Open-Domain Dialog
Miaoran Li | Baolin Peng | Michel Galley | Jianfeng Gao | Zhu (Drew) Zhang

The construction of dialog systems for various types of conversations, such as task-oriented dialog (TOD) and open-domain dialog (ODD), has been an active area of research. In order to more closely mimic human-like conversations that often involve the fusion of different dialog modes, it is important to develop systems that can effectively handle both TOD and ODD and access different knowledge sources. In this work, we present a new automatic framework to enrich TODs with synthesized ODDs. We also introduce the PivotBot model, which is capable of handling both TOD and ODD modes and can access different knowledge sources to generate informative responses. Evaluation results indicate the superior ability of the proposed model to switch smoothly between TOD and ODD tasks.

pdf bib
Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System
Jianguo Zhang | Stephen Roller | Kun Qian | Zhiwei Liu | Rui Meng | Shelby Heinecke | Huan Wang | Silvio Savarese | Caiming Xiong

End-to-end task-oriented dialogue (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-trained models. This work enables the TOD systems with more flexibility through a simple cache. The cache provides the flexibility to dynamically update the TOD systems and handle both existing and unseen dialogue scenarios. Towards this end, we first fine-tune a retrieval module to effectively retrieve the most relevant information entries from the cache. We then train end-to-end TOD models that can refer to and ground on both dialogue history and retrieved information during TOD generation. The introduced cache is straightforward to construct, and the backbone models of TOD systems are compatible with existing pre-trained generative models. Extensive experiments demonstrate the superior performance of our framework, with a notable improvement in non-empty joint goal accuracy by 6.7% compared to strong baselines.

pdf bib
Transformer-based Multi-Party Conversation Generation using Dialogue Discourse Acts Planning
Alexander Chernyavskiy | Dmitry Ilvovsky

Recent transformer-based approaches to multi-party conversation generation may produce syntactically coherent but discursively inconsistent dialogues in some cases. To address this issue, we propose an approach to integrate a dialogue act planning stage into the end-to-end transformer-based generation pipeline. This approach consists of a transformer fine-tuning procedure based on linearized dialogue representations that include special discourse tokens. The obtained results demonstrate that incorporating discourse tokens into training sequences is sufficient to significantly improve dialogue consistency and overall generation quality. The suggested approach performs well, including for automatically annotated data. Apart from that, it is observed that increasing the weight of the discourse planning task in the loss function accelerates learning convergence.

pdf bib
Incorporating Annotator Uncertainty into Representations of Discourse Relations
S. Magalí López Cortez | Cassandra L. Jacobs

Annotation of discourse relations is a known difficult task, especially for non-expert annotators. In this paper, we investigate novice annotators’ uncertainty on the annotation of discourse relations on spoken conversational data. We find that dialogue context (single turn, pair of turns within speaker, and pair of turns across speakers) is a significant predictor of confidence scores. We compute distributed representations of discourse relations from co-occurrence statistics that incorporate information about confidence scores and dialogue context. We perform a hierarchical clustering analysis using these representations and show that weighting discourse relation representations with information about confidence and dialogue context coherently models our annotators’ uncertainty about discourse relation labels.

pdf bib
Investigating the Representation of Open Domain Dialogue Context for Transformer Models
Vishakh Padmakumar | Behnam Hedayatnia | Di Jin | Patrick Lange | Seokhwan Kim | Nanyun Peng | Yang Liu | Dilek Hakkani-Tur

The bulk of work adapting transformer models to open-domain dialogue represents dialogue context as the concatenated set of turns in natural language. However, it is unclear if this is the best approach. In this work, we investigate this question by means of an empirical controlled experiment varying the dialogue context format from text-only formats (all recent utterances, summaries, selected utterances) as well as variants that are more structurally different (triples, AMR). We compare these formats based on fine-tuned model performance on two downstream tasks—knowledge selection and response generation. We find that simply concatenating the utterances works as a strong baseline in most cases, but is outperformed in longer contexts by a hybrid approach of combining a summary of the context with recent utterances. Through empirical analysis, our work highlights the need to examine the format of context representation and offers recommendations on adapting general-purpose language models to dialogue tasks.

pdf bib
C3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues
Hung Le | Nancy Chen | Steven C.H. Hoi

Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partially accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning (C3) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual samples based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context.

pdf bib
No that’s not what I meant: Handling Third Position Repair in Conversational Question Answering
Vevake Balaraman | Arash Eshghi | Ioannis Konstas | Ioannis Papaioannou

The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent after the addressee’s erroneous response. Here, we collect and publicly release REPAIR-QA, the first large dataset of TPRs in a conversational question answering (QA) setting. The data is comprised of the TPR turns, corresponding dialogue contexts, and candidate repairs of the original turn for execution of TPRs. We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs. For stand-alone TPR execution, we perform both automatic and human evaluations on a fine-tuned T5 model, as well as OpenAI’s GPT-3 LLMs. Additionally, we extrinsically evaluate the LLMs’ TPR processing capabilities in the downstream conversational QA task. The results indicate poor out-of-the-box performance on TPR’s by the GPT-3 models, which then significantly improves when exposed to REPAIR-QA.

pdf bib
When to generate hedges in peer-tutoring interactions
Alafate Abulimiti | Chloé Clavel | Justine Cassell

This paper explores the application of machine learning techniques to predict where hedging occurs in peer-tutoring interactions. The study uses a naturalistic face-to-face dataset annotated for natural language turns, conversational strategies, tutoring strategies, and nonverbal behaviors. These elements are processed into a vector representation of the previous turns, which serves as input to several machine learning models, including MLP and LSTM. The results show that embedding layers, capturing the semantic information of the previous turns, significantly improves the model’s performance. Additionally, the study provides insights into the importance of various features, such as interpersonal rapport and nonverbal behaviors, in predicting hedges by using Shapley values for feature explanation. We discover that the eye gaze of both the tutor and the tutee has a significant impact on hedge prediction. We further validate this observation through a follow-up ablation study.

pdf bib
PaperPersiChat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management
Alexander Chernyavskiy | Max Bregeda | Maria Nikiforova

The rate of scientific publications is increasing exponentially, necessitating a significant investment of time in order to read and comprehend the most important articles. While ancillary services exist to facilitate this process, they are typically closed-model and paid services or have limited capabilities. In this paper, we present PaperPersiChat, an open chatbot-system designed for the discussion of scientific papers. This system supports summarization and question-answering modes within a single end-to-end chatbot pipeline, which is guided by discourse analysis. To expedite the development of similar systems, we also release the gathered dataset, which has no publicly available analogues.

pdf bib
FurChat: An Embodied Conversational Agent using LLMs, Combining Open and Closed-Domain Dialogue with Facial Expressions
Neeraj Cherakara | Finny Varghese | Sheena Shabana | Nivan Nelson | Abhiram Karukayil | Rohith Kulothungan | Mohammed Afil Farhan | Birthe Nesset | Meriam Moujahid | Tanvi Dinkar | Verena Rieser | Oliver Lemon

We demonstrate an embodied conversational agent that can function as a receptionist and generate a mixture of open and closed-domain dialogue along with facial expressions, by using a large language model (LLM) to develop an engaging conversation. We deployed the system onto a Furhat robot, which is highly expressive and capable of using both verbal and nonverbal cues during interaction. The system was designed specifically for the National Robotarium to interact with visitors through natural conversations, providing them with information about the facilities, research, news, upcoming events, etc. The system utilises the state-of-the-art GPT-3.5 model to generate such information along with domain-general conversations and facial expressions based on prompt engineering.

pdf bib
Towards Breaking the Self-imposed Filter Bubble in Argumentative Dialogues
Annalena Aicher | Daniel Kornmueller | Yuki Matsuda | Stefan Ultes | Wolfgang Minker | Keiichi Yasumoto

Human users tend to selectively ignore information that contradicts their pre-existing beliefs or opinions in their process of information seeking. These “self-imposed filter bubbles” (SFB) pose a significant challenge for cooperative argumentative dialogue systems aiming to build an unbiased opinion and a better understanding of the topic at hand. To address this issue, we develop a strategy for overcoming users’ SFB within the course of the interaction. By continuously modeling the user’s position in relation to the SFB, we are able to identify the respective arguments which maximize the probability to get outside the SFB and present them to the user. We implemented this approach in an argumentative dialogue system and evaluated in a laboratory user study with 60 participants to show its validity and applicability. The findings suggest that the strategy was successful in breaking users’ SFBs and promoting a more reflective and comprehensive discussion of the topic.

pdf bib
The Open-domain Paradox for Chatbots: Common Ground as the Basis for Human-like Dialogue
Gabriel Skantze | A. Seza Doğruöz

There is a surge in interest in the development of open-domain chatbots, driven by the recent advancements of large language models. The “openness” of the dialogue is expected to be maximized by providing minimal information to the users about the common ground they can expect, including the presumed joint activity. However, evidence suggests that the effect is the opposite. Asking users to “just chat about anything” results in a very narrow form of dialogue, which we refer to as the “open-domain paradox”. In this position paper, we explain this paradox through the theory of common ground as the basis for human-like communication. Furthermore, we question the assumptions behind open-domain chatbots and identify paths forward for enabling common ground in human-computer dialogue.

pdf bib
MERCY: Multiple Response Ranking Concurrently in Realistic Open-Domain Conversational Systems
Sarik Ghazarian | Behnam Hedayatnia | Di Jin | Sijia Liu | Nanyun Peng | Yang Liu | Dilek Hakkani-Tur

Automatic Evaluation (AE) and Response Selection (RS) models assign quality scores to various candidate responses and rank them in conversational setups. Prior response ranking research compares various models’ performance on synthetically generated test sets. In this work, we investigate the performance of model-based reference-free AE and RS models on our constructed response ranking datasets that mirror real-case scenarios of ranking candidates during inference time. Metrics’ unsatisfying performance can be interpreted as their low generalizability over more pragmatic conversational domains such as human-chatbot dialogs. To alleviate this issue we propose a novel RS model called MERCY that simulates human behavior in selecting the best candidate by taking into account distinct candidates concurrently and learns to rank them. In addition, MERCY leverages natural language feedback as another component to help the ranking task by explaining why each candidate response is relevant/irrelevant to the dialog context. These feedbacks are generated by prompting large language models in a few-shot setup. Our experiments show the better performance of MERCY over baselines for the response ranking task in our curated realistic datasets.

pdf bib
Empathetic Response Generation for Distress Support
Anuradha Welivita | Chun-Hung Yeh | Pearl Pu

AI-driven chatbots are seen as an attractive solution to support people undergoing emotional distress. One of the main components of such a chatbot is the ability to empathize with the user. But a significant limitation in achieving this goal is the lack of a large dialogue dataset containing empathetic support for those undergoing distress. In this work, we curate a large-scale dialogue dataset that contains ≈1.3M peer support dialogues spanning across more than 4K distress-related topics. We analyze the empathetic characteristics of this dataset using statistical and visual means. To demonstrate the utility of this dataset, we train four baseline neural dialogue models that can respond empathetically to distress prompts. Two of the baselines adapt existing architecture and the other two incorporate a framework identifying levels of cognitive and emotional empathy in responses. Automatic and human evaluation of these models validate the utility of the dataset in generating empathetic responses for distress support and show that identifying levels of empathy in peer-support responses facilitates generating responses that are lengthier, richer in empathy, and closer to the ground truth.

pdf bib
Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation
Yahui Fu | Koji Inoue | Chenhui Chu | Tatsuya Kawahara

Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user’s experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user’s perspective, ignoring the system’s perspective. In this paper, we propose a commonsense-based causality explanation approach for diverse empathetic response generation that considers both the user’s perspective (user’s desires and reactions) and the system’s perspective (system’s intentions and reactions). We enhance ChatGPT’s ability to reason for the system’s perspective by integrating in-context learning with commonsense knowledge. Then, we integrate the commonsense-based causality explanation with both ChatGPT and a T5-based model. Experimental evaluations demonstrate that our method outperforms other comparable methods on both automatic and human evaluations.

up

bib (full) Proceedings of the 1st Workshop on CounterSpeech for Online Abuse (CS4OA)

pdf bib
From Generic to Personalized: Investigating Strategies for Generating Targeted Counter Narratives against Hate Speech
Mekselina Doğanç | Ilia Markov

The spread of hate speech (HS) in the digital age poses significant challenges, with online platforms becoming breeding grounds for harmful content. While many natural language processing (NLP) studies have focused on identifying hate speech, few have explored the generation of counter narratives (CNs) as means to combat it. Previous studies have shown that computational models often generate CNs that are dull and generic, and therefore do not resonate with hate speech authors. In this paper, we explore the personalization capabilities of computational models for generating more targeted and engaging CNs. This paper investigates various strategies for incorporating author profiling information into GPT-2 and GPT-3.5 models to enhance the personalization of CNs to combat online hate speech. We investigate the effectiveness of incorporating author profiling aspects, more specifically the age and gender information of HS authors, in tailoring CNs specifically targeted at HS spreaders. We discuss the challenges, opportunities, and future directions for incorporating user profiling information into CN interventions.

pdf bib
Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention Regularization
Helena Bonaldi | Giuseppe Attanasio | Debora Nozza | Marco Guerini

Recent computational approaches for combating online hate speech involve the automatic generation of counter narratives by adapting Pretrained Transformer-based Language Models (PLMs) with human-curated data. This process, however, can produce in-domain overfitting, resulting in models generating acceptable narratives only for hatred similar to training data, with little portability to other targets or to real-world toxic language. This paper introduces novel attention regularization methodologies to improve the generalization capabilities of PLMs for counter narratives generation. Overfitting to training-specific terms is then discouraged, resulting in more diverse and richer narratives. We experiment with two attention-based regularization techniques on a benchmark English dataset. Regularized models produce better counter narratives than state-of-the-art approaches in most cases, both in terms of automatic metrics and human evaluation, especially when hateful targets are not present in the training data. This work paves the way for better and more flexible counter-speech generation models, a task for which datasets are highly challenging to produce.

pdf bib
Distilling Implied Bias from Hate Speech for Counter Narrative Selection
Nami Akazawa | Serra Sinem Tekiroğlu | Marco Guerini

Hate speech is a critical problem in our society and social media platforms are often an amplifier for this phenomenon. Recently the use of Counter Narratives (informative and non-aggressive responses) has been proposed as a viable solution to counter hateful content that goes beyond simple detection-removal strategies. In this paper we present a novel approach along this line of research, which utilizes the implied statement (bias) expressed in the hate speech to retrieve an appropriate counter narrative. To this end, we first trained and tested several LMs that, given a hateful post, generate the underlying bias and the target group. Then, for the counter narrative selection task, we experimented with several methodologies that either use or not use the implied bias during the process. Experiments show that using the target group information allows the system to better focus on relevant content and that implied statement for selecting counter narratives is better than the corresponding standard approach that does not use it. To our knowledge, this is the first attempt to build an automatic selection tool that uses hate speech implied bias to drive Counter Narrative selection.

pdf bib
Just Collect, Don’t Filter: Noisy Labels Do Not Improve Counterspeech Collection for Languages Without Annotated Resources
Pauline Möhle | Matthias Orlikowski | Philipp Cimiano

Counterspeech on social media is rare. Consequently, it is difficult to collect naturally occurring examples, in particular for languages without annotated datasets. In this work, we study methods to increase the relevance of social media samples for counterspeech annotation when we lack annotated resources. We use the example of sourcing German data for counterspeech annotations from Twitter. We monitor tweets from German politicians and activists to collect replies. To select relevant replies we a) find replies that match German abusive keywords or b) label replies for counterspeech using a multilingual classifier fine-tuned on English data. For both approaches and a baseline setting, we annotate a random sample and use bootstrap sampling to estimate the amount of counterspeech. We find that neither the multilingual model nor the keyword approach achieve significantly higher counts of true counterspeech than the baseline. Thus, keyword lists or multi-lingual classifiers are likely not worth the added complexity beyond purposive data collection: Already without additional filtering, we gather a meaningful sample with 7,4% true counterspeech.

pdf bib
What Makes Good Counterspeech? A Comparison of Generation Approaches and Evaluation Metrics
Yi Zheng | Björn Ross | Walid Magdy

Counterspeech has been proposed as a solution to the proliferation of online hate. Research has shown that natural language processing (NLP) approaches could generate such counterspeech automatically, but there are competing ideas for how NLP models might be used for this task and a variety of evaluation metrics whose relationship to one another is unclear. We test three different approaches and collect ratings of the generated counterspeech for 1,740 tweet-participant pairs to systematically compare the counterspeech on three aspects: quality, effectiveness and user preferences. We examine which model performs best at which metric and which aspects of counterspeech predict user preferences. A free-form text generation approach using ChatGPT performs the most consistently well, though its generations are occasionally unspecific and repetitive. In our experiment, participants’ preferences for counterspeech are predicted by the quality of the counterspeech, not its perceived effectiveness. The results can help future research approach counterspeech evaluation more systematically.

up

bib (full) Proceedings of The Eleventh Dialog System Technology Challenge

pdf bib
Exploring Prompt-based Multi-task Learning for Multimodal Dialog State Tracking and Immersive Multimodal Conversation
Yirong Chen | Ya Li | Tao Wang | Xiaofen Xing | Xiangmin Xu | Quan Liu | Cong Liu | Guoping Hu

With the rise of the metaverse, immersive multimodal conversation has attracted more and more researchers’ attention. Multimodal contexts will become more important for human-computer interaction in the metaverse, especially in shopping domain. Unlike traditional conversation tasks, immersive multimodal conversation has challenges such as multimodal ambiguous candidate identification and multimodal coreference resolution, which makes it more difficult to dialog state tracking and response generation, as described in SIMMC 2.1 challenge, a part of DSTC11. In particular, as the number of objects in the scene increases, the difficulty will increase dramatically. We proposed a prompt-based multi-task learning Encoder-Decoder, in which different subtasks use different prompts to make the model tend to focus on the current subtask. We achieve the winner in ambiguous candidates indentification and runner-up in multimodal coreference resolution (MM-Coref), multimodal dialog state tracking (MM-DST) and assistant response generation. Our code and model are made publicly available at https://github.com/scutcyr/dstc11-simmc2.1-scut-bds-lab.

pdf bib
Multi-Task Learning for Ambiguous Candidate Identification with Pre-trained Model
Daesik Jang | Hyewon Choi

Recently, research using multimodal datasets containing image and text information has been conducted actively. One of them is the SIMMC2.1 dataset. It is a more complicated dataset than answering a conversation using only text because it should predict an answer after understanding the relationship between images and text. Therefore, there are limitations to answering a conversation only using text-based models such as BERT or GPT-2, so models with both image and language understanding abilities should be considered. We propose a new model that is effective for the ambiguous candidate identification task in DSTC11 SIMMC2.1 Tark. It consists of a simple pipeline model structure, which has two steps. The first step is to check whether there is ambiguity in the current user utterance, and the second step is to extract objects mentioned in the ambiguous utterance of the user. We suggest a new learning framework with a pre-trained image model and text model that is effective for the ambiguous candidate identification task. Experiments show that the proposed method can improve the model performance, and our model achieved 3rd place in sub-task 1 of the SIMMC2.1 track.

pdf bib
Improving Situated Conversational Agents with Step-by-Step Multi-modal Logic Reasoning
Yuxing Long | Huibin Zhang | Binyuan Hui | Zhenglu Yang | Caixia Yuan | Xiaojie Wang | Fei Huang | Yongbin Li

To fulfill complex user requirements in a situated conversational scenario, the agent needs to conduct step-by-step multi-modal logic reasoning, which includes locating objects, querying information and searching objects. However, existing methods omit this multi-step procedure and therefore constitutes the risk of shortcuts when making predictions. For example, they may directly copy the information from the dialogue history or simply use the textual description without perform visual reasoning. To address this issue and further boost the system performance, we apply the dual process theory to plug a reasoner into the original transformer based model for step-by-step reasoning. When system 2 completes multi-step reasoning, its output is regarded as final prediction. Our proposed method achieved the 1st rank on the summing scores across all four DSTC-11 SIMMC 2.1 sub-tasks.

pdf bib
Contrastively Pretrained Vision-Language Transformers and Domain Adaptation Methods for Multimodal TOD Systems
Youngjae Chang | Doo Young Kim | Jinyoung Kim | Keunha Kim | Hyunmook Cha | Suyoung Min | Youngjoong Ko | Kye-Hwan Lee | Joonwoo Park

The Situated Interactive MultiModal Conversations (SIMMC2.1) Challenge 2022 is hosted by the Eleventh Dialog System Technology Challenge (DSTC11). This is the third consecutive year multimodal dialog systems have been selected as an official track of the competition, promoted by the continued interest in the research community. The task of SIMMC is to create a shopping assistant agent that can communicate with customers in a virtual store. It requires processing store scenes and product catalogs along with the customer’s request. The task is decomposed into four steps and each becomes a subtask. In this work, we explore the common approaches to modeling multimodality and find the method with the most potential. We also identify a discrepancy in using pretrained language models for dialog tasks and devise a simple domain-adaptation method. Our model came in third place for object coreferencing, dialog state tracking, and response generation tasks.

pdf bib
Multi-Stage Coarse-to-Fine Contrastive Learning for Conversation Intent Induction
Caiyuan Chu | Ya Li | Yifan Liu | Jia-Chen Gu | Quan Liu | Yongxin Ge | Guoping Hu

Intent recognition is critical for task-oriented dialogue systems. However, for emerging domains and new services, it is difficult to accurately identify the key intent of a conversation due to time-consuming data annotation and comparatively poor model transferability. Therefore, the automatic induction of dialogue intention is very important for intelligent dialogue systems. This paper presents our solution to Track 2 of Intent Induction from Conversations for Task-Oriented Dialogue at the Eleventh Dialogue System Technology Challenge (DSTC11). The essence of intention clustering lies in distinguishing the representation of different dialogue utterances. The key to automatic intention induction is that, for any given set of new data, the sentence representation obtained by the model can be well distinguished from different labels. Therefore, we propose a multi-stage coarse-to-fine contrastive learning model training scheme including unsupervised contrastive learning pre-training, supervised contrastive learning pre-training, and fine-tuning with joint contrastive learning and clustering to obtain a better dialogue utterance representation model for the clustering task. In the released DSTC11 Track 2 evaluation results, our proposed system ranked first on both of the two subtasks of this Track.

pdf bib
DORIC : Domain Robust Fine-Tuning for Open Intent Clustering through Dependency Parsing
Jihyun Lee | Seungyeon Seo | Yunsu Kim | Gary Geunbae Lee

We present our work on Track 2 in the Dialog System Technology Challenges 11 (DSTC11). DSTC11-Track2 aims to provide a benchmark for zero-shot, cross-domain, intent-set induction. In the absence of in-domain training dataset, robust utterance representation that can be used across domains is necessary to induce users’ intentions. To achieve this, we leveraged a multi-domain dialogue dataset to fine-tune the language model and proposed extracting Verb-Object pairs to remove the artifacts of unnecessary information. Furthermore, we devised the method that generates each cluster’s name for the explainability of clustered results. Our approach achieved 3rd place in the precision score and showed superior accuracy and normalized mutual information (NMI) score than the baseline model on various domain datasets.

pdf bib
A Two-Stage Progressive Intent Clustering for Task-Oriented Dialogue
Bingzhu Du | Nan Su | Yuchi Zhang | Yongliang Wang

Natural Language Understanding (NLU) is one of the most critical components of task-oriented dialogue, and it is often considered as an intent classification task. To achieve outstanding intent identification performance, system designers often need to hire a large number of domain experts to label the data, which is inefficient and costly. To address this problem, researchers’ attention has gradually shifted to automatic intent clustering methods, which employ low-resource unsupervised approaches to solve classification problems. The classical framework for clustering is deep clustering, which uses deep neural networks (DNNs) to jointly optimize non-clustering loss and clustering loss. However, for new conversational domains or services, utterances required to assign intents are scarce and the performance of DNNs is often dependent on large amounts of data. In addition, although re-clustering with k-means algorithm after training the network usually leads to better results, k-means methods often suffer from poor stability. To address these problems, we propose an effective two-stage progressive approach to refine the clustering. Firstly, we pre-train the network with contrastive loss using all conversations data and then optimize the clustering loss and contrastive loss simultaneously. Secondly, we propose adaptive progressive k-means to alleviate the randomness of vanilla k-means, achieving better performance and smaller deviation. Our method ranks second in DSTC11 Track2 Task 1, a benchmark for intent clustering of task-oriented dialogue, demonstrating the superiority and effectiveness of our method.

pdf bib
Analysis of Utterance Embeddings and Clustering Methods Related to Intent Induction for Task-Oriented Dialogue
Jeiyoon Park | Yoonna Jang | Chanhee Lee | Heuiseok Lim

The focus of this work is to investigate unsupervised approaches to overcome quintessential challenges in designing task-oriented dialog schema: assigning intent labels to each dialog turn (intent clustering) and generating a set of intents based on the intent clustering methods (intent induction). We postulate there are two salient factors for automatic induction of intents: (1) clustering algorithm for intent labeling and (2) user utterance embedding space. We compare existing off-the-shelf clustering models and embeddings based on DSTC11 evaluation. Our extensive experiments demonstrate that the combined selection of utterance embedding and clustering method in the intent induction task should be carefully considered. We also present that pretrained MiniLM with Agglomerative clustering shows significant improvement in NMI, ARI, F1, accuracy and example coverage in intent induction tasks. The source codes are available at https://github.com/Jeiyoon/dstc11-track2.

pdf bib
Multi-View Zero-Shot Open Intent Induction from Dialogues: Multi Domain Batch and Proxy Gradient Transfer
Hyukhun Koh | Haesung Pyun | Nakyeong Yang | Kyomin Jung

In Task Oriented Dialogue (TOD) system, detecting and inducing new intents are two main challenges to apply the system in the real world. In this paper, we suggest the semantic multiview model to resolve these two challenges: (1) SBERT for General Embedding (GE), (2) Multi Domain Batch (MDB) for dialogue domain knowledge, and (3) Proxy Gradient Transfer (PGT) for cluster-specialized semantic. MDB feeds diverse dialogue datasets to the model at once to tackle the multi-domain problem by learning the multiple domain knowledge. We introduce a novel method PGT, which employs the Siamese network to fine-tune the model with a clustering method directly. Our model can learn how to cluster dialogue utterances by using PGT. Experimental results demonstrate that our multi-view model with MDB and PGT significantly improves the Open Intent Induction performance compared to baseline systems.

pdf bib
Adapting Text-based Dialogue State Tracker for Spoken Dialogues
Jaeseok Yoon | Seunghyun Hwang | Han Ran | Jeong-Uk Bang | Kee-Eung Kim

Although there have been remarkable advances in dialogue systems through the dialogue systems technology competition (DSTC), it remains one of the key challenges to building a robust task-oriented dialogue system with a speech interface. Most of the progress has been made for text-based dialogue systems since there are abundant datasets with written cor- pora while those with spoken dialogues are very scarce. However, as can be seen from voice assistant systems such as Siri and Alexa, it is of practical importance to transfer the success to spoken dialogues. In this paper, we describe our engineering effort in building a highly successful model that participated in the speech-aware dialogue systems technology challenge track in DSTC11. Our model consists of three major modules: (1) automatic speech recognition error correction to bridge the gap between the spoken and the text utterances, (2) text-based dialogue system (D3ST) for estimating the slots and values using slot descriptions, and (3) post-processing for recovering the error of the estimated slot value. Our experiments show that it is important to use an explicit automatic speech recognition error correction module, post-processing, and data augmentation to adapt a text-based dialogue state tracker for spoken dialogue corpora.

pdf bib
CopyT5: Copy Mechanism and Post-Trained T5 for Speech-Aware Dialogue State Tracking System
Cheonyoung Park | Eunji Ha | Yewon Jeong | Chi-young Kim | Haeun Yu | Joo-won Sung

In a real-world environment, Dialogue State Tracking (DST) should use speech recognition results to perform tasks. However, most existing DST research has been conducted in text-based environments. This study aims to build a model that efficiently performs Automatic Speech Recognition-based DST. To operate robustly against speech noise, we used CopyT5, which adopted a copy mechanism, and trained the model using augmented data including speech noise. Furthermore, CopyT5 performed post-training using the masked language modeling method with the MultiWOZ dataset in T5 in order to learn the dialogue context better. The copy mechanism also mitigated name entity errors that may occur during DST generation. Experiments confirmed that data augmentation, post-training, and the copy mechanism effectively improve DST performance.

pdf bib
OLISIA: a Cascade System for Spoken Dialogue State Tracking
Léo Jacqmin | Lucas Druart | Yannick Estève | Benoît Favre | Lina M Rojas | Valentin Vielzeuf

Though Dialogue State Tracking (DST) is a core component of spoken dialogue systems, recent work on this task mostly deals with chat corpora, disregarding the discrepancies between spoken and written language. In this paper, we propose OLISIA, a cascade system which integrates an Automatic Speech Recognition (ASR) model and a DST model. We introduce several adaptations in the ASR and DST modules to improve integration and robustness to spoken conversations. With these adaptations, our system ranked first in DSTC11 Track 3, a benchmark to evaluate spoken DST. We conduct an in-depth analysis of the results and find that normalizing the ASR outputs and adapting the DST inputs through data augmentation, along with increasing the pre-trained models size all play an important role in reducing the performance discrepancy between written and spoken conversations.

pdf bib
Speech-Aware Multi-Domain Dialogue State Generation with ASR Error Correction Modules
Ridong Jiang | Wei Shi | Bin Wang | Chen Zhang | Yan Zhang | Chunlei Pan | Jung Jae Kim | Haizhou Li

Prior research on dialogue state tracking (DST) is mostly based on written dialogue corpora. For spoken dialogues, the DST model trained on the written text should use the results (or hypothesis) of automatic speech recognition (ASR) as input. But ASR hypothesis often includes errors, which leads to significant performance drop for spoken dialogue state tracking. We address the issue by developing the following ASR error correction modules. First, we train a model to convert ASR hypothesis to ground truth user utterance, which can fix frequent patterns of errors. The model takes ASR hypotheses of two ASR models as input and fine-tuned in two stages. The corrected hypothesis is fed into a large scale pre-trained encoder-decoder model (T5) for DST training and inference. Second, if an output slot value from the encoder-decoder model is a name, we compare it with names in a dictionary crawled from Web sites and, if feasible, replace with the crawled name of the shortest edit distance. Third, we fix errors of temporal expressions in ASR hypothesis by using hand-crafted rules. Experiment results on the DSTC 11 speech-aware dataset, which is built on the popular MultiWOZ task (version 2.1), show that our proposed method can effectively mitigate the performance drop when moving from written text to spoken conversations.

pdf bib
Three Ways of Using Large Language Models to Evaluate Chat
Ondřej Plátek | Vojtech Hudecek | Patricia Schmidtova | Mateusz Lango | Ondrej Dusek

This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.

pdf bib
Parallel Corpora Alignment Framework for Multilingual and Robust Automatic Dialogue Evaluation
Xinglin Wang | Jiayi Shi | Peiwen Yuan | Kan Li

Open-domain automatic dialogue evaluation plays an important role in dialogue systems. While recent efforts are being put into making learning-based evaluation metrics correlate better with human evaluation, robust metrics for parallel corpora and multiple domains remain unexplored. Parallel corpora refer to corpora that express the same idea in different ways (e.g., translation, paraphrasing and back-translation). In this paper, we propose Parallel Corpora Alignment Framework (PCAF), which improves the consistency and robustness of model evaluation on parallel corpora. Firstly, parallel corpora are aligned in semantic space through parallel-corpora-aligned contrastive learning. Then, parallel-corpora-aligned distillation on multi-dataset is applied to further improve model’s generalization ability across multiple data domains. Our approach ranks second on the final test data of DSTC11 track4 subtask1 (“Multilingual Automatic Evaluation Metrics”, turn-level) and third on the subtask2 (“Robust Automatic Evaluation Metrics”, turn-level), which proves the strong generalization ability and robustness of our proposed approach.

pdf bib
Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation
John Mendonça | Patrícia Pereira | Helena Moniz | Joao Paulo Carvalho | Alon Lavie | Isabel Trancoso

Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 “Automatic Evaluation Metrics for Open-Domain Dialogue Systems”, proving the evaluation capabilities of prompted LLMs.

pdf bib
Towards Optimizing Pre-trained Language Model Ensemble Learning for Task-oriented Dialogue System
Zhiyuan Zhu | Yusheng Liao | Zhe Chen | Yu Wang | Yunfeng Guan

Task-oriented dialogue systems that employ external knowledge to generate informative responses have become an important field of research. This paper outlines our contribution to Track 5 of the Eleventh Dialog System Technology Challenge (DSTC11), which focuses on constructing high-performing, subjective knowledge-enriched task-oriented dialogue systems. Specifically, we investigate the complementarity of various language models to tackle the diverse knowledge selection task that involves multiple external sources. Based on this investigation, we propose pre- and post-generation model ensemble approaches to mitigate potential biases inherent in using a single model for the knowledge selection task. Finally, we utilize the consensus decoding approach to combine fine-tuned ensemble models and improve the performance of the generation system. Our system ranked 1st in human evaluation, even outperforming human annotation.

pdf bib
Enhancing Task-Oriented Dialog System with Subjective Knowledge: A Large Language Model-based Data Augmentation Framework
Haein Jung | Heuiyeen Yeen | Jeehyun Lee | Minju Kim | Namo Bang | Myoung-Wan Koo

As Task-Oriented Dialog (TOD) systems have advanced, structured DB systems, which aim to collect relevant knowledge for answering user’s questions, have also progressed. Despite these advancements, these methods face challenges when dealing with subjective questions from users. To overcome this, DSTC11 released a subjective-knowledge-based TOD (SK-TOD) dataset and benchmark. This paper introduces a framework that effectively solves SK-TOD tasks by leveraging a Large Language Model (LLM). We demonstrate the proficient use of LLM for each sub-task, including an adapters-based method and knowledge-grounded data augmentation. Our proposed methods, which utilize LLM as an efficient tool, outperform baseline performance and approaches that directly use LLM as a one-step sub-task solver, showing superior task-specific optimization.

pdf bib
Semantic data augmentation for meaning maintenance on Task-Oriented Conversation with Large-size Language Model
Jaehwan Lee | Kwanyoung Son | Eugene Kim

This paper presents our approach to building a generalized model for Track 5 in DSTC11: “Task-oriented Conversational Modeling with Subjective Knowledge” which addresses the challenge of generating responses to users’ utterances based on a variety of factual and subjective knowledge. To tackle this challenge, we first augmented the training data by leveraging contextual word embedding and back translation, thereby increasing the quantity of available data. Then, we utilized a large-size language model to enhance the acceptability of the augmented data and fine-tuned the model using augmented data. Specifically, we applied the DeBERTa-v3-large model for knowledge detection and selection, and the BART-large model for response generation. Our best model achieved the seventh rank in the objective evaluation and the second rank in the final official human evaluation. These outcomes serve as solid evidence that data augmentation and using a large-size model were highly effective for developing a conversational model system that incorporates objective and subjective knowledge.

pdf bib
Ensemble Method via Ranking Model for Conversational Modeling with Subjective Knowledge
Xin Huang | Kye Min Tan | Richeng Duan | Bowei Zou

This paper describes our submission to the fifth track of the 11th Dialog System Technology Challenge (DSTC-11), which focuses on “Task-oriented Conversational Modeling with Subjective Knowledge”. We focus on response generation and leverage a ranking strategy to ensemble individual models of BART, Long-T5, and a fine-tuned large language model based on LLaMA. The strategy is supplemented by other techniques like low rank adaptation to maintain efficient utilization of these large models while still achieving optimal performance. The experiments show that the ensemble method outperforms individual models and the baseline method. Our model was ranked 1st place in ROUGE_1, 2nd place in ROUGE_L score and 4th place in human evaluation among a total of 14 participating teams.

pdf bib
Exploring Back Translation with Typo Noise for Enhanced Inquiry Understanding in Task-Oriented Dialogue
Jihyun Lee | Junseok Kim | Gary Geunbae Lee

This paper presents our approach to the DSTC11 Track 5 selection task, which focuses on retrieving appropriate natural language knowledge sources for task-oriented dialogue. We propose typologically diverse back-translation method with typo noise, which could generate various structured user inquries. Through our noised back translation, we augmented inquiries by combining three different typologies of language sources with five different typo noise injections. Our experiments demonstrate that typological variety and typo noise aids the model in generalizing to diverse user inquiries in dialogue. In the competition, where 14 teams participated, our approach achieved the 5th rank for exact matching metric.

pdf bib
Leveraging Few-Shot Data Augmentation and Waterfall Prompting for Response Generation
Lea Krause | Selene Báez Santamaría | Michiel van der Meer | Urja Khurana

This paper discusses our approaches for task-oriented conversational modelling using subjective knowledge, with a particular emphasis on response generation. Our methodology was shaped by an extensive data analysis that evaluated key factors such as response length, sentiment, and dialogue acts present in the provided dataset. We used few-shot learning to augment the data with newly generated subjective knowledge items and present three approaches for DSTC11: (1) task-specific model exploration, (2) incorporation of the most frequent question into all generated responses, and (3) a waterfall prompting technique using a combination of both GPT-3 and ChatGPT.

pdf bib
Leveraging Ensemble Techniques and Metadata for Subjective Knowledge-grounded Conversational Systems
Seongho Joo | Kang-il Lee | Kyungmin Min | Joongbo Shin | Janghoon Han | Seungpil Won | Kyomin Jung

The goal of DSTC11 track 5 is to build task-oriented dialogue systems that can effectively utilize external knowledge sources such as FAQs and reviews. This year’s challenge differs from previous ones as it includes subjective knowledge snippets and requires multiple snippets for a single turn. We propose a pipeline system for the challenge focusing on entity tracking, knowledge selection and response generation. Specifically, we devise a novel heuristic to ensemble the outputs from the rule-based method and neural model for entity tracking and knowledge selection. We also leverage metadata information in the knowledge source to handle fine-grained user queries. Our approach achieved the first place in objective evaluation and the third place in human evaluation of DSTC11 track 5.

pdf bib
A Difference-aware Ensemble Method for Task-oriented Dialogue with Subjective Knowledge
Changxin Ke | Churui Sun | Longxuan Ma | Wei-Nan Zhang | Ting Liu

We participate in the 11th Dialog System Technology Challenges (DSTC) track-5 called Task-oriented Conversational Modeling with Subjective Knowledge. Introducing subjective knowledge into task-oriented dialogue (TOD) can help the DS to understand variables of subjective user needs and to suit more dialogue scenarios. Track-5 includes several sub-tasks: 1) knowledge-seeking turn detection; 2) knowledge entity tracking; 3) knowledge entry selection; and 4) use of the selected knowledge entries for response generation. Besides the challenges of each sub-tasks own, there are two challenges across different sub-tasks. The first is that there are multiple valid knowledge entries for each knowledge-seeking turn, the accuracy of the knowledge entry selection is important for the quality of response generation. The second challenge is how to address the unseen dialogue/entities/entries in the validation and the test set. In this paper, we propose a difference-aware ensemble method to address these sub-tasks and the two challenges mentioned above. Our method helps to obtain more robust results and performs well on unseen instances. Among all the submissions for the test set, our method ranks 1st on the knowledge-seeking turn detection task and achieves 3rd on the overall automatic evaluation score. Our code and data will be released on GitHub.

pdf bib
DSTC-11: Speech Aware Task-Oriented Dialog Modeling Track
Hagen Soltau | Izhak Shafran | Mingqiu Wang | Abhinav Rastogi | Wei Han | Yuan Cao

Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, our goal in hosting the speech-aware dialog state tracking challenge was to create a public corpus or task which can be used to investigate the performance gap between the written and spoken forms of input, develop models that could alleviate this gap, and establish whether Text-to-Speech-based (TTS) systems is a reasonable surrogate to the more-labor intensive human data collection. We created three spoken versions of the popular written-domain MultiWoz task – (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs. Additionally, we provided different forms of ASR output to encourage wider participation from teams that may not have access to state-of-the-art ASR systems. These included ASR transcripts, word time stamps, and latent representations of the audio (audio encoder outputs). In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain.

pdf bib
Overview of Situated and Interactive Multimodal Conversations (SIMMC) 2.1 Track at DSTC 11
Satwik Kottur | Seungwhan Moon

With ever increasing interest in task-oriented dialog systems, the recent work on Situated and Interactive Multimodal Conversations (SIMMC 2.0) aims to develop personal assistants that interact with users, grounded in an immersive and co-observed setting of photo-realistic scenes. The dataset contains 11k task-oriented dialogs set in an interactive shopping scenario, spanning more than 117k utterances. In order to push research towards this next generation virtual assistants, the SIMMC 2.1 challenge was conducted at the Eleventh Dialog System Technology Challenge (DSTC) which had entries from across the world competing to achieve the state-of-the-art performance in the SIMMC 2.1 task. In this report, we present and compare 13 SIMMC 2.1 model entries from 5 trams across the world to understand the current progress made across the last three years (starting with SIMMC 1.0 and 2.0 challenges) for multimodal task-oriented dialog systems. We hope that our analysis throws light on components that showed promise in addition to identifying the gaps for future research towards this grand goal of an immersive multimodal conversational agent.

pdf bib
Intent Induction from Conversations for Task-Oriented Dialogue Track at DSTC 11
James Gung | Raphael Shu | Emily Moeng | Wesley Rose | Salvatore Romeo | Arshit Gupta | Yassine Benajiba | Saab Mansour | Yi Zhang

With increasing demand for and adoption of virtual assistants, recent work has investigated ways to accelerate bot schema design through the automatic induction of intents or the induction of slots and dialogue states. However, a lack of dedicated benchmarks and standardized evaluation has made progress difficult to track and comparisons between systems difficult to make. This challenge track, held as part of the Eleventh Dialog Systems Technology Challenge, introduces a benchmark that aims to evaluate methods for the automatic induction of customer intents in a realistic setting of customer service interactions between human agents and customers. We propose two subtasks for progressively tackling the automatic induction of intents and corresponding evaluation methodologies. We then present three datasets suitable for evaluating the tasks and propose simple baselines. Finally, we summarize the submissions and results of the challenge track, for which we received submissions from 34 teams.

pdf bib
Overview of Robust and Multilingual Automatic Evaluation Metricsfor Open-Domain Dialogue Systems at DSTC 11 Track 4
Mario Rodríguez-Cantelar | Chen Zhang | Chengguang Tang | Ke Shi | Sarik Ghazarian | João Sedoc | Luis Fernando D’Haro | Alexander I. Rudnicky

The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics’ correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.

pdf bib
Task-Oriented Conversational Modeling with Subjective Knowledge Track in DSTC11
Seokhwan Kim | Spandana Gella | Chao Zhao | Di Jin | Alexandros Papangelis | Behnam Hedayatnia | Yang Liu | Dilek Z Hakkani-Tur

Conventional Task-oriented Dialogue (TOD) Systems rely on domain-specific APIs/DBs or external factual knowledge to create responses. In DSTC11 track 5, we aims to provide a new challenging task to accommodate subjective user requests (e.g.,”Is the WIFI reliable?” or “Does the restaurant have a good atmosphere?” into TOD. We release a benchmark dataset, which contains subjective knowledge-seeking dialogue contexts and manually annotated responses that are grounded in subjective knowledge sources. The challenge track received a total of 48 entries from 14 participating teams.


up

bib (full) Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)

pdf bib
Confidently Wrong: Exploring the Calibration and Expression of (Un)Certainty of Large Language Models in a Multilingual Setting
Lea Krause | Wondimagegnhue Tufa | Selene Baez Santamaria | Angel Daza | Urja Khurana | Piek Vossen

While the fluency and coherence of Large Language Models (LLMs) in text generation have seen significant improvements, their competency in generating appropriate expressions of uncertainty remains limited.Using a multilingual closed-book QA task and GPT-3.5, we explore how well LLMs are calibrated and express certainty across a diverse set of languages, including low-resource settings. Our results reveal strong performance in high-resource languages but a marked decline in performance in lower-resource languages. Across all, we observe an exaggerated expression of confidence in the model, which does not align with the correctness or likelihood of its responses. Our findings highlight the need for further research into accurate calibration of LLMs especially in a multilingual setting.

pdf bib
Visual Question Generation in Bengali
Mahmud Hasan | Labiba Islam | Jannatul Ruma | Tasmiah Mayeesha | Rashedur Rahman

The task of Visual Question Generation (VQG) is to generate human-like questions relevant to the given image. As VQG is an emerging research field, existing works tend to focus only on resource-rich language such as English due to the availability of datasets. In this paper, we propose the first Bengali Visual Question Generation task and develop a novel transformer-based encoder-decoder architecture that generates questions in Bengali when given an image. We propose multiple variants of models - (i) image-only: baseline model of generating questions from images without additional information, (ii) image-category and image-answer-category: guided VQG where we condition the model to generate questions based on the answer and the category of expected question. These models are trained and evaluated on the translated VQAv2.0 dataset. Our quantitative and qualitative results establish the first state of the art models for VQG task in Bengali and demonstrate that our models are capable of generating grammatically correct and relevant questions. Our quantitative results show that our image-cat model achieves a BLUE-1 score of 33.12 and BLEU-3 score of 7.56 which is the highest of the other two variants. We also perform a human evaluation to assess the quality of the generation tasks. Human evaluation suggests that image-cat model is capable of generating goal-driven and attribute-specific questions and also stays relevant to the corresponding image.

pdf bib
Keeping an Eye on Context: Attention Allocation over Input Partitions in Referring Expression Generation
Simeon Schüz | Sina Zarrieß

In Referring Expression Generation, model inputs are often composed of different representations, including the visual properties of the intended referent, its relative position and size, and the visual context. Yet, the extent to which this information influences the generation process of black-box neural models is largely unclear. We investigate the relative weighting of target, location, and context information in the attention components of a Transformer-based generation model. Our results show a general target bias, which, however, depends on the content of the generated expressions, pointing to interesting directions for future research.

pdf bib
Are Language-and-Vision Transformers Sensitive to Discourse? A Case Study of ViLBERT
Ekaterina Voloshina | Nikolai Ilinykh | Simon Dobnik

Language-and-vision models have shown good performance in tasks such as image-caption matching and caption generation. However, it is challenging for such models to generate pragmatically correct captions, which adequately reflect what is happening in one image or several images. It is crucial to evaluate this behaviour to understand underlying reasons behind it. Here we explore to what extent contextual language-and-vision models are sensitive to different discourse, both textual and visual. In particular, we employ one of the multi-modal transformers (ViLBERT) and test if it can match descriptions and images, differentiating them from distractors of different degree of similarity that are sampled from different visual and textual contexts. We place our evaluation in the multi-sentence and multi-image setup, where images and sentences are expected to form a single narrative structure. We show that the model can distinguish different situations but it is not sensitive to differences within one narrative structure. We also show that performance depends on the task itself, for example, what modality remains unchanged in non-matching pairs or how similar non-matching pairs are to original pairs.

pdf bib
Using Large Language Models for Zero-Shot Natural Language Generation from Knowledge Graphs
Agnes Axelsson | Gabriel Skantze

In any system that uses structured knowledge graph (KG) data as its underlying knowledge representation, KG-to-text generation is a useful tool for turning parts of the graph data into text that can be understood by humans. Recent work has shown that models that make use of pretraining on large amounts of text data can perform well on the KG-to-text task, even with relatively little training data on the specific graph-to-text task. In this paper, we build on this concept by using large language models to perform zero-shot generation based on nothing but the model’s understanding of the triple structure from what it can read. We show that ChatGPT achieves near state-of-the-art performance on some measures of the WebNLG 2020 challenge, but falls behind on others. Additionally, we compare factual, counter-factual and fictional statements, and show that there is a significant connection between what the LLM already knows about the data it is parsing and the quality of the output text.

pdf bib
The 2023 WebNLG Shared Task on Low Resource Languages. Overview and Evaluation Results (WebNLG 2023)
Liam Cripwell | Anya Belz | Claire Gardent | Albert Gatt | Claudia Borg | Marthese Borg | John Judge | Michela Lorandi | Anna Nikiforovskaya | William Soto Martinez

The WebNLG task consists of mapping a knowledge graph to a text verbalising the con- tent of that graph. The 2017 WebNLG edi- tion required participating systems to gener- ate English text from a set of DBpedia triples, while the 2020 WebNLG+ challenge addition- ally included generation into Russian and se- mantic parsing of English and Russian texts. In contrast, WebNLG 2023 focuses on four under-resourced languages which are severely under-represented in research on text genera- tion, namely Breton, Irish, Maltese and Welsh. In addition, WebNLG 2023 once again includes Russian. In this paper, we present the organi- sation of the shared task (data, timeline, eval- uation), briefly describe the participating sys- tems and summarise results for participating systems.

pdf bib
WebNLG-Interno: Utilizing FRED-T5 to address the RDF-to-text problem (WebNLG 2023)
Maxim Kazakov | Julia Preobrazhenskaya | Ivan Bulychev | Aleksandr Shain

We present our solution for the Russian RDF002 to-text generation task of the WebNLG Challenge 2023. We use the pretrained large language model named FRED-T5 (Zmitrovich et al., 2023) to finetune on the train dataset. Also, we propose several types of prompt and run experiments to analyze their effectiveness. Our submission achieves 0.373 TER on the test dataset, taking the first place according to the results of the automatic evaluation and outperforming the best result of the previous challenge by 0.025. The code of our solution is available at the following link: https://github.com/Ivan30003/webnlg_interno

pdf bib
Better Translation + Split and Generate for Multilingual RDF-to-Text (WebNLG 2023)
Nalin Kumar | Saad Obaid Ul Islam | Ondrej Dusek

This paper presents system descriptions of our submitted outputs for WebNLG Challenge 2023. We use mT5 in multi-task and multilingual settings to generate more fluent and reliable verbalizations of the given RDF triples. Furthermore, we introduce a partial decoding technique to produce more elaborate yet simplified outputs. Additionally, we demonstrate the significance of employing better translation systems in creating training data.

pdf bib
Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate (WebNLG 2023)
Michela Lorandi | Anya Belz

LLMs are great at tasks involving English which dominates in their training data. We explore their ability to address tasks involving languages that are severely under-represented in their training data. More specifically, we do this in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested GPT-3.5 and~4 with a range of prompt types and formats on a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced languages, and (ii) generation into English followed by translation into the under-resourced languages. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed all other systems by substantial margins in all languages on all automatic metrics. We conclude that good performance can be achieved with state-of-the-art LLMs out-of-the box for under-resourced languages. However, best results (for Welsh) of BLEU 25.12, ChrF++ 0.55, and TER 0.64 are well below the lowest ranked English system at WebNLG’20 with BLEU 0.391, ChrF++ 0.579, and TER 0.564.

pdf bib
DCU/TCD-FORGe at WebNLG’23: Irish rules! (WegNLG 2023)
Simon Mille | Elaine Uí Dhonnchadha | Stamatia Dasiopoulou | Lauren Cassidy | Brian Davis | Anya Belz

In this paper, we describe the submission of Dublin City University (DCU) and Trinity College Dublin (TCD) for the WebNLG 2023 shared task. We present a fully rule-based pipeline for generating Irish texts from DBpedia triple sets which comprises 4 components: triple lexicalisation, generation of noninflected Irish text, inflection generation, and post-processing.

pdf bib
WebNLG Challenge 2023: Domain Adaptive Machine Translation for Low-Resource Multilingual RDF-to-Text Generation (WebNLG 2023)
Kancharla Aditya Hari | Bhavyajeet Singh | Anubhav Sharma | Vasudeva Varma

This paper presents our submission to the WebNLG Challenge 2023 for generating text in several low-resource languages from RDF-triples. Our submission focuses on using machine translation for generating texts in Irish, Maltese, Welsh and Russian. While a simple and straightfoward approach, recent works have shown that using monolingual models for inference for multilingual tasks with the help of machine translation (translate-test) can out-perform multilingual models and training multilingual models on machine-translated data (translate-train) through careful tuning of the MT component. Our results show that this approach demonstrates competitive performance for this task even with limited data.

up

bib (full) Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants!

pdf bib
CST5: Data Augmentation for Code-Switched Semantic Parsing
Anmol Agarwal | Jigar Gupta | Rahul Goel | Shyam Upadhyay | Pankaj Joshi | Rengarajan Aravamudhan

Extending semantic parsers to code-switched input has been a challenging problem, primarily due to a lack of supervised training data. In this work, we introduce CST5, a new data augmentation technique that fine-tunes a T5 model using a small seed set (≈100 utterances) to generate code-switched utterances from English utterances. We show that CST5 generates high quality code-switched data, both intrinsically (per human evaluation) and extrinsically by comparing baseline models which are trained without data augmentation to models which are trained with augmented data. Empirically we observe that using CST5, one can achieve the same semantic parsing performance by using up to 20x less labeled data. To aid further research in this area, we are also releasing (a) Hinglish-TOP, the largest human annotated code-switched semantic parsing dataset to date, containing 10k human annotated Hindi-English (Hinglish) code-switched utterances, and (b) Over 170K CST5 generated code-switched utterances from the TOPv2 dataset. Human evaluation shows that both the human annotated data as well as the CST5 generated data is of good quality.

pdf bib
PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su | Tian Lan | Huayang Li | Jialu Xu | Yan Wang | Deng Cai

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do.

pdf bib
Emotion-Conditioned Text Generation through Automatic Prompt Optimization
Yarik Menchaca Resendiz | Roman Klinger

Conditional natural language generation methods often require either expensive fine-tuning or training a large language model from scratch. Both are unlikely to lead to good results without a substantial amount of data and computational resources. Prompt learning without changing the parameters of a large language model presents a promising alternative. It is a cost-effective approach, while still achieving competitive results. While this procedure is now established for zero- and few-shot text classification and structured prediction, it has received limited attention in conditional text generation. We present the first automatic prompt optimization approach for emotion-conditioned text generation with instruction-fine-tuned models. Our method uses an iterative optimization procedure that changes the prompt by adding, removing, or replacing tokens. As objective function, we only require a text classifier that measures the realization of the conditional variable in the generated text. We evaluate the method on emotion-conditioned text generation with a focus on event reports and compare it to manually designed prompts that also act as the seed for the optimization procedure. The optimized prompts achieve 0.75 macro-average F1 to fulfill the emotion condition in contrast to manually designed seed prompts with only 0.22 macro-average F1.

pdf bib
Mitigating Harms of LLMs via Knowledge Distillation for a Virtual Museum Tour Guide
Ashley Lewis | Michael White

LLMs are known to be very powerful, exhibiting both great benefits and great risk. We seek to leverage the benefits, in particular the ability to be fluent, conversational dialogue agents, while minimizing the risks, such as hallucination and toxic content. In this work we use knowledge distillation to create a virtual museum tour guide dialogue agent, employing ChatGPT as a teacher model for a smaller student model, T5-large. We find the T5 model shows competitive performance, significantly reduces instances of hallucination, and shows promise for reducing toxic content.

pdf bib
Evaluating Large Language Models for Document-grounded Response Generation in Information-Seeking Dialogues
Norbert Braunschweiler | Rama Doddipatla | Simon Keizer | Svetlana Stoyanchev

In this paper, we investigate the use of large language models (LLMs) like ChatGPT for document-grounded response generation in the context of information-seeking dialogues. For evaluation, we use the MultiDoc2Dial corpus of task-oriented dialogues in four social service domains previously used in the DialDoc 2022 Shared Task. Information-seeking dialogue turns are grounded in multiple documents providing relevant information. We generate dialogue completion responses by prompting a ChatGPT model, using two methods: Chat-Completion and LlamaIndex. ChatCompletion uses knowledge from ChatGPT model pre-training while LlamaIndex also extracts relevant information from documents. Observing that document-grounded response generation via LLMs cannot be adequately assessed by automatic evaluation metrics as they are significantly more verbose, we perform a human evaluation where annotators rate the output of the shared task winning system, the two ChatGPT variants outputs, and human responses. While both ChatGPT variants are more likely to include information not present in the relevant segments, possibly including a presence of hallucinations, they are rated higher than both the shared task winning system and human responses.

pdf bib
Enhancing Pipeline-Based Conversational Agents with Large Language Models
Mina Foosherian | Hendrik Purwins | Purna Rathnayake | Touhidul Alam | Rui Teimao | Klaus-Dieter Thoben

The latest advancements in AI and deep learning have led to a breakthrough in large language model (LLM)-based agents such as GPT-4. However, many commercial conversational agent development tools are pipeline-based and have limitations in holding a human-like conversation. This paper investigates the capabilities of LLMs to enhance pipeline-based conversational agents during two phases: 1) in the design and development phase and 2) during operations. In 1) LLMs can aid in generating training data, extracting entities and synonyms, localization, and persona design. In 2) LLMs can assist in contextualization, intent classification to prevent conversational breakdown and handle out-of-scope questions, auto-correcting utterances, rephrasing responses, formulating disambiguation questions, summarization, and enabling closed question-answering capabilities. We conducted informal experiments with GPT-4 in the private banking domain to demonstrate the scenarios above with a practical example. Companies may be hesitant to replace their pipeline-based agents with LLMs entirely due to privacy concerns and the need for deep integration within their existing ecosystems. A hybrid approach in which LLMs’ are integrated into the pipeline-based agents allows them to save time and costs of building and running agents by capitalizing on the capabilities of LLMs while retaining the integration and privacy safeguards of their existing systems.

pdf bib
Style Locality for Controllable Generation with kNN Language Models
Gilles Nawezi | Lucie Flek | Charles Welch

Recent language models have been improved by the addition of external memory. Nearest neighbor language models retrieve similar contexts to assist in word prediction. The addition of locality levels allows a model to learn how to weight neighbors based on their relative location to the current text in source documents, and have been shown to further improve model performance. Nearest neighbor models have been explored for controllable generation but have not examined the use of locality levels. We present a novel approach for this purpose and evaluate it using automatic and human evaluation on politeness, formality, supportiveness, and toxicity textual data. We find that our model is successfully able to control style and provides a better fluency-style trade-off than previous work

up

pdf (full)
bib (full)
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems

pdf bib
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems
Vojtech Hudecek | Patricia Schmidtova | Tanvi Dinkar | Javier Chiyah-Garcia | Weronika Sieinska

pdf bib
Processing Referential Ambiguities in Situated Dialogue Systems
Javier Chiyah-Garcia

Position paper for YRRSDS 2023

pdf bib
Safety and Robustness in Conversational AI
Tanvi Dinkar

In this position paper, I will present the research interests in my PostDoc on safety and robustness specific to conversational AI, including then relevant overlap from my PhD.

pdf bib
Incremental Speech Processing for Voice Assistant Accessibility
Angus Addlesee

Speech production is nuanced and unique to every individual, but today’s Spoken Dialogue Systems (SDSs) are trained to use general speech patterns to successfully improve performance on various evaluation metrics. However, these patterns do not apply to certain user groups - often the very people that can benefit the most from SDSs. For example, people with dementia produce more disfluent speech than the general population. The healthcare domain is now a popular setting for spoken dialogue and human-robot interaction research. This trend is similar when observing company behaviour. Charities promote industry voice assistants, the creators are getting HIPAA compliance, and their features sometimes target vulnerable user groups. It is therefore critical to adapt SDSs to be more accessible.

pdf bib
Advancing Spoken Dialog Systems for Manufacturing: From Conceptual Architecture and Taxonomy to Real Case Applications and Future Directions
Silvia Colabianchi

This research encompasses a comprehensive exploration of Spoken Dialogue Systems (SDSs) in the manufacturing sector. It begins by establishing a conceptual architecture and taxonomy to guide the design and selection of SDS elements. Real case applications, including worker safety and cybersecurity support, validate the research findings and highlight areas for improvement. Looking ahead, the study delves into the potential of Large Language Models (LLMs) and multi-modal applications. Emphasizing the importance of extreme personalization, the study highlights the need to cater to the diverse qualifications and preferences of workers. Additionally, it investigates the integration of SDSs with other sensory modalities, such as images, videos, and augmented or virtual reality scenarios, to enhance the user experience and productivity. The research also addresses crucial considerations related to knowledge base optimization. It examines semantic variations of words across different application contexts, the continuous updating of procedures and data, and the adaptability of SDSs to diverse dialects and linguistic abilities, particularly in low-schooling personnel scenarios. Privacy, industrial protection, and ethical concerns in the era of LLMs and external players like OpenAI are given due attention. The study explores the boundaries of knowledge that conversational systems should possess, advocating for transparency, explainability, and responsible data handling practices.

pdf bib
Conversational Grounding in Multimodal Dialog Systems
Biswesh Mohapatra

The process of “conversational grounding” is an interactive process that has been studied extensively in cognitive science, whereby participants in a conversation check to make sure their interlocutors understand what is being referred to. This interactive process uses multiple modes of communication to establish the information between the participants. This could include information provided through eye-gaze, head movements, intonation in speech, along with the content of the speech. While the process is essential to successful communication between humans and between humans and machines, work needs to be done on testing and building the capabilities of the current dialogue system in managing conversational grounding, especially in multimodal medium of communication. Recent work such as Benotti and Blackburn have shown the importance of conversational grounding in dialog systems and how current systems fail in them. This is essential for the advancement of Embodied Conversational Agents and Social Robots. Thus my PhD project aims to test, understand and improve the functioning of current dialog models with respect to Conversational Grounding.

pdf bib
SQL Comment Generation and Additional Research Interests
Alyssa Allen

My research interests focus on natural language generation (NLG) regarding how to make system outputs more intuitive and comprehensible for the human-user and conversational entrainment and alignment from the perspective of how dialogue systems could or should personalize its responses to the human user. As it relates to NLG, my current work focuses on training a system to auto-generate comments for SQL queries produced by a Text-to-SQL parser. The goal is to make the connection between technical SQL language and the user’s question more transparent. My linguistic training lies primarily at the intersection of computational and socio-linguistics. As such, my curiosities in conversational entrainment and alignment focus on the extent to which conversational agents can or should adjust their language based on human characteristics such as age, race, or gender.

pdf bib
On Referring Language Use in Visually Grounded Dialogue
Bram Willemsen

Position paper for YRRSDS 2023

pdf bib
Challenges and Approaches in Designing Social SDS in the LLM Era
Koji Inoue

Large language models (LLMs) have brought about a significant transformation in spoken dialogue systems (SDSs). It is anticipated that these systems will be implemented into diverse robotic applications and employed in a variety of social settings. The author presents research interest with the aim of realizing social SDSs from multiple perspectives, including task design, turn-taking mechanisms, and evaluation methodologies. Additionally, future research in social SDSs should delve into a deeper understanding of user mental states and a relationship with society via multi-party conversations. Finally, the author suggests topics for discussion regarding the future directions of SDS researchers in the LLM era.

pdf bib
Breakdowns and Repairs. Detecting Patterns that Lead to Breakdowns in Customer Service Messages
Anouck Braggaar

Many companies use dialogue systems for their customer service, and although there has been a rise in the usage of these systems (Costello and LoDolce, 2022), many of these systems still face challenges in comprehending and properly responding to the customer (Følstadet al., 2021). In our project we aim to figure out how to develop and improve these conversational agents. Part of this project (detailed in this paper) will focus on the detection of breakdown patterns and the possible solutions (repairs) to mitigate negative results of these errors.

pdf bib
Towards More Natural Dialogues: Integrating Open-Domain Dialogue Skills into Task-Oriented Agents
Armand Stricker

Position paper on the intersection between chitchat and task-oriented dialogues (TODs), with a focus on integrating capabilities typically associated with chitchat systems into task-oriented agents.

pdf bib
The Future of Designing Spoken Dialogue Systems and Analyzing Written Conversations
Livia Qian

This is my position paper for YRRSDS 2023. In it, I write about the details of my research interests as well as past, current and future projects, talk about the status of spoken dialogue system research, include a short bio, and suggest topics for discussion.

pdf bib
Exploring the Synergy of Deep Learning and Anthropomorphism in Multimodal Dialogue Systems
Iwona Christop

This position paper is an overview of author’s main research interests and work considering deep learning techniques in audio classification, sign languages, and multimodality in dialogue systems. Author also shares her opinion on current and future research considering dialogue agents, and suggests topics for discussion panels.

pdf bib
A Perspective on Anchoring and Dialogue History Propagation for Smoother Interactions with Spoken Task-Oriented Dialogue Systems
Lucas Druart

Task-Oriented Dialogue (TOD) systems provide interactive assistance to a user in order to accomplish a specific task such as making a reservation at a restaurant or booking a room in a hotel. Speech presents itself as a natural interface for TOD systems. A typical approach to implement them is to use a modular architecture (Gao et al., 2018). A core component of such dialogue systems is Spoken Language Understanding (SLU) whose goal is to extract the relevant information from the user’s utterances. While spoken dialogue was the focus of earlier work (Williams et al., 2013; Henderson et al., 2014), recent work has focused on text inputs with no regard for the specificities of spoken language (Wu et al., 2019; Heck et al., 2020; Feng et al., 2021). However, this approach fails to account for the differences between written and spoken language (Faruqui and Hakkani-Tür, 2022) such as disfluencies. My research focuses on Spoken Language Understanding in the context of Task-Oriented Dialogue. More specifically I am interested in the two following research directions: • Annotation schema for spoken TODs, • Integration of dialogue history for contextually coherent predictions.

pdf bib
More Human-Like Interaction in Spoken Dialogue Systems: Global Context for Natural Language Understanding and Multimodal Solutions
Kacper Dudzic

My position paper for the YRRSDS 2023 workshop.

pdf bib
Designing and Evaluating LLM-based Conversational Agents for Behaviour Change
Selina Meyer

My PhD focuses on conversational agents for behaviour change, with a focus on the feasibility of applying Large Language Models (LLMs) such as GPT-4 in this context.

pdf bib
Stylized Dialog Response Generation
Sourabrata Mukherjee

My primary research focus lies in the domain of Text Style Transfer (TST), a fascinating area within Natural Language Processing (NLP). TST involves the transfor- mation of text into a desired style while approximately preserving its underlying content. In my research, I am also driven by the goal of incorporating TST techniques into NLP systems, particularly within the realm of dia- logue systems. I am intrigued by the concept of Stylized Dialog Response Generation, which aims to enhance the versatility and adaptability of dialog systems in generat- ing text responses with specific style attributes. By ad- vancing our understanding of TST and its integration into dialogue systems, my research seeks to contribute to the broader field of human-computer interaction. Through the development of robust and versatile dialogue systems with enhanced style transfer capabilities, we can facili- tate more engaging and personalized conversational experiences.

pdf bib
Take the Most out of Text Data Augmentation Strategies For Intent Clustering And Induction Based on DSTC 11 Track 2
Mikołaj Krzymiński

A brief introduction to author’s keyinterests and research topics which are: multimodal dialogue systems and impact of data augmentation to NLU performance. In addition to that the author shares his biography and view on the future of dialogue assistants.

pdf bib
Advancing Dialogue Systems: Measuring User Satisfaction and Embracing Multimodality
Adrian Charkiewicz

This submission discusses my research interests in two areas: measuring user satisfaction in goal-oriented dialogue systems and exploring the potential of multi-modal interactions. For goal-oriented dialogue systems, I focus on evaluating and enhancing user satisfaction throughout the interaction process, aiming to propose innovative strategies and address the limitations of existing evaluation techniques. Additionally, I explore the benefits of multi-modal dialogue systems, highlighting their ability to provide more natural and immersive conversations by incorporating various communication modes such as speech, text, gestures, and visuals.

pdf bib
Information Extraction and Program Synthesis from Goal-Oriented Dialogue
Sopan Khosla

My research interests broadly lie in the area of Information Extraction from Spoken Dialogue, with a spacial focus on state modeling, anaphora resolution, program synthesis & planning, and intent classification in goal-oriented conversations. My aim is to create embedded dialogue systems that can interact with humans in a collaborative setup to solve tasks in a digital/non-digital environment. Most of the goal-oriented conversations usually involve experts and a laypersons. The aim for the expert is to consider all the information provided by the layperson, identify the underlying set of issues or intents, and prescribe solutions. While human experts are very good at extracting such information, AI agents (that build up most of the automatic dialog systems today) not so much. Most of the existing assistants (or chatbots) only consider individual utterances and do not ground them in the context of the dialogue. My work in this direction has focused on making these systems more effective at extracting the most relevant information from the dialogue to help the human user reach their end-goal.

pdf bib
Modelling Emotions in Task-Oriented Dialogue
Shutong Feng

My research interests lie in the area of modelling natural and human-like conversations, with a special focus on emotions in task-oriented dialogue (ToD) systems. ToD systems need to produce semantically and grammatically correct responses to fulfil the user’s goal. Being able to perceive and express emotions pushes them one more step towards achieving human-likeness. To begin with, I constructed a dataset with meaningful emotion labels as well as a wide coverage of emotions and linguistic features in ToDs. Then, I improved emotion recognition in conversations (ERC) in the task-oriented domain by exploiting key characteristics of ToDs. Currently, I am working towards enhancing ToD systems with emotions.

pdf bib
Incrementally Enriching the Common Ground: A Research Path
Brielen Madureira

I am broadly interested in evaluation of dialogue systems, in all its many facets: The data they are trained on, their ability to perform a task successfully, their skills with respect to various dialogue phenomena, their resemblance to human cognitive processes, and their ethical and societal impact. More specifically, my research topics focus on understanding the possibilities and limits of current multimodal neural network-based models to incrementally encode information for natural language understanding in general and also for building common ground and asking for clarification. Besides, I am interested in dialogue games as a means to elicit and collect dialogue data and to evaluate the abilities of dialogue models.

pdf bib
Commonsense Enabled Conversational Model and System-Initiated transitions in Unified SDSs
Ye Liu

My research work centers on how to enable a human-like interaction through generating contextual, emotional or proactive responses, both in task-oriented and in chitchat spoken dialogue systems (SDSs), because natural lan- guage generation (NLG) is an indispensable component in SDSs and can directly affect the user interactive expe- rience of the entire dialogue system. In addition to NLG, I am also interested in natural language understanding (NLU), as it plays a crucial role in SDSs and is a prerequisite for dialogue systems to generate replies.

pdf bib
Causality Reasoning for Empathy-Enriched and Personality-Conditioned Spoken Dialogue System
Yahui Fu

The author’s objective centers around developing a spoken dialogue system (SDS) that can emulate the cognitive and conversational qualities of a human friend. Key attributes such as empathy, knowledge/causality reasoning, and personality are integral components of human interaction. The proposed approach involves the creation of an Empathy-enriched SDS, capable of comprehending human emotions and circumstances, thus providing companionship and assistance akin to a trusted friend. Additionally, the Causality-reasoning for SDS aims to ground the system in commonsense knowledge and equip it with the ability to reason about causalities, such as predicting user desires/reactions and system intentions/reactions, thereby enhancing the system’s intelligence and human-like behavior. Finally, the concept of a Personality-conditioned SDS involves enabling systems to exhibit distinct personalities, further enhancing the naturalness of human-robot interaction.

pdf bib
Tutorials and User Adaptation in Task Oriented Dialogue
Ryu Hirai

This position paper describes my research interests, spoken dialogue system research, and suggested topics for discussion.