Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, Helena Moniz (Editors)

Anthology ID:: 2023.eamt-1
Month:: June
Year:: 2023
Address:: Tampere, Finland
Venue:: EAMT
SIG:
Publisher:: European Association for Machine Translation
URL:: https://aclanthology.org/2023.eamt-1
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2023.eamt-1.pdf

PDF (full) BibTeX Search

pdf bib
Towards Efficient Universal Neural Machine Translation
Biao Zhang

While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high-cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizabile, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues — data scarcity and domain mismatch — this paper combines domain adaptation and data augmentation within a robust QE system. Our method is to first train a generic QE model and then fine-tune it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines.

pdf bib abs
Example-Based Machine Translation from Textto a Hierarchical Representation of Sign Language
Elise Bertin-Lemée | Annelies Braffort | Camille Challant | Claire Danet | Michael Filhol

This article presents an original method for Text-to-Sign Translation. It compensates data scarcity using a domain-specific parallel corpus of alignments between text and hierarchical formal descriptions of Sign Language videos. Based on the detection of similarities present in the source text, the proposed algorithm recursively exploits matches and substitutions of aligned segments to build multiple candidate translations for a novel statement. This helps preserving Sign Language structures as much as possible before falling back on literal translations too quickly, in a generative way. The resulting translations are in the form of AZee expressions, designed to be used as input to avatar synthesis systems. We present a test set tailored to showcase its potential for expressiveness and generation of idiomatic target language, and observed limitations. This work finally opens prospects on how to evaluate this kind of translation.

pdf bib abs
Unsupervised Feature Selection for Effective Parallel Corpus Filtering
Mikko Aulamo | Ona de Gibert | Sami Virpioja | Jörg Tiedemann

This work presents an unsupervised method of selecting filters and threshold values for the OpusFilter parallel corpus cleaning toolbox. The method clusters sentence pairs into noisy and clean categories and uses the features of the noisy cluster center as filtering parameters. Our approach utilizes feature importance analysis to disregard filters that do not differentiate between clean and noisy data. A randomly sampled subset of a given corpus is used for filter selection and ineffective filters are not run for the full corpus. We use a set of automatic evaluation metrics to assess the quality of translation models trained with data filtered by our method and data filtered with OpusFilter’s default parameters. The trained models cover English-German and English-Ukrainian in both directions. The proposed method outperforms the default parameters in all translation directions for almost all evaluation metrics.

pdf bib abs
Filtering and rescoring the CCMatrix corpus for Neural Machine Translation training
Antoni Oliver González | Sergi Álvarez

There are several parallel corpora available for many language pairs, such as CCMatrix, built from mass downloads of web content and automatic detection of segments in one language and the translation equivalent in another. These techniques can produce large parallel corpora, but of questionable quality. In many cases, the segments are not in the required languages, or if they are, they are not translation equivalents. In this article, we present an algorithm for filtering out the segments in languages other than the required ones and re-scoring the segments using SBERT. A use case on the Spanish-Asturian and Spanish-Catalan CCMatrix corpus is presented.

pdf bib abs
BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation
Taisiya Glushkova | Chrysoula Zerva | André F. T. Martins

Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical errors, such as deviations in entities and numbers. In contrast, traditional evaluation metrics such as BLEU or chrF, which measure lexical or character overlap between translation hypotheses and human references, have lower correlations with human judgements but are sensitive to such deviations. In this paper, we investigate several ways of combining the two approaches in order to increase robustness of state-of-the-art evaluation methods to translations with critical errors. We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena, which leads to gains in correlations with humans and on the recent DEMETR benchmark on several language pairs.

pdf bib abs
Exploiting large pre-trained models for low-resource neural machine translation
Aarón Galiano-Jiménez | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz

Pre-trained models have drastically changed the field of natural language processing by providing a way to leverage large-scale language representations to various tasks. Some pre-trained models offer general-purpose representations, while others are specialized in particular tasks, like neural machine translation (NMT). Multilingual NMT-targeted systems are often fine-tuned for specific language pairs, but there is a lack of evidence-based best-practice recommendations to guide this process. Moreover, the trend towards even larger pre-trained models has made it challenging to deploy them in the computationally restrictive environments typically found in developing regions where low-resource languages are usually spoken. We propose a pipeline to tune the mBART50 pre-trained model to 8 diverse low-resource language pairs, and then distil the resulting system to obtain lightweight and more sustainable models. Our pipeline conveniently exploits back-translation, synthetic corpus filtering, and knowledge distillation to deliver efficient, yet powerful bilingual translation models 13 times smaller than the original pre-trained ones, but with close performance in terms of BLEU.

pdf bib abs
Enhancing Supervised Learning with Contrastive Markings in Neural Machine Translation Training
Nathaniel Berger | Miriam Exel | Matthias Huck | Stefan Riezler

Supervised learning in Neural Machine Translation (NMT) standardly follows a teacher forcing paradigm where the conditioning context in the model’s prediction is constituted by reference tokens, instead of its own previous predictions. In order to alleviate this lack of exploration in the space of translations, we present a simple extension of standard maximum likelihood estimation by a contrastive marking objective. The additional training signals are extracted automatically from reference translations by comparing the system hypothesis against the reference, and used for up/down-weighting correct/incorrect tokens. The proposed new training procedure requires one additional translation pass over the training set, and does not alter the standard inference setup. We show that training with contrastive markings yields improvements on top of supervised learning, and is especially useful when learning from postedits where contrastive markings indicate human error corrections to the original hypotheses.

pdf bib abs
Return to the Source: Assessing Machine Translation Suitability
Francesco Fernicola | Silvia Bernardini | Federico Garcea | Adriano Ferraresi | Alberto Barrón-Cedeño

We approach the task of assessing the suitability of a source text for translation by transferring the knowledge from established MT evaluation metrics to a model able to predict MT quality a priori from the source text alone. To open the door to experiments in this regard, we depart from reference English-German parallel corpora to build a corpus of 14,253 source text-quality score tuples. The tuples include four state-of-the-art metrics: cushLEPOR, BERTScore, COMET, and TransQuest. With this new resource at hand, we fine-tune XLM-RoBERTa, both in a single-task and a multi-task setting, to predict these evaluation scores from the source text alone. Results for this methodology are promising, with the single-task model able to approximate well-established MT evaluation and quality estimation metrics - without looking at the actual machine translations - achieving low RMSE values in the [0.1-0.2] range and Pearson correlation scores up to 0.688.

pdf bib abs
Empirical Analysis of Beam Search Curse and Search Errors with Model Errors in Neural Machine Translation
Jianfei He | Shichao Sun | Xiaohua Jia | Wenjie Li

Beam search is the most popular decoding method for Neural Machine Translation (NMT) and is still a strong baseline compared with the newly proposed sampling-based methods. To better understand beam search, we investigate its two well-recognized issues, beam search curse and search errors, at the sentence level. We find that only less than 30% of sentences in the test set experience these issues. Meanwhile, there is a related phenomenon. For the majority of sentences, their gold references have lower probabilities than the predictions from beam search. We also test with different levels of model errors including a special test using training samples and models without regularization. We find that these phenomena still exist even for a model with an accuracy of 95% although they are mitigated. These findings show that it is not promising to improve beam search by seeking higher probabilities in searching and further reducing its search errors. The relationship between the quality and the probability of predictions at the sentence level in our results provides useful information to find new ways to improve NMT.

pdf bib abs
An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models
Varun Gumma | Raj Dabre | Pratyush Kumar

Knowledge distillation (KD) is a well-known method for compressing neural models. However, works focusing on distilling knowledge from large multilingual neural machine translation (MNMT) models into smaller ones are practically nonexistent, despite the popularity and superiority of MNMT. This paper bridges this gap by presenting an empirical investigation of knowledge distillation for compressing MNMT models. We take Indic to English translation as a case study and demonstrate that commonly used language-agnostic and language-aware KD approaches yield models that are 4-5x smaller but also suffer from performance drops of up to 3.5 BLEU. To mitigate this, we then experiment with design considerations such as shallower versus deeper models, heavy parameter sharing, multistage training, and adapters. We observe that deeper compact models tend to be as good as shallower non-compact ones and that fine-tuning a distilled model on a high-quality subset slightly boosts translation quality. Overall, we conclude that compressing MNMT models via KD is challenging, indicating immense scope for further research.

This paper aims to investigate the effectiveness of the k-Nearest Neighbor Machine Translation model (kNN-MT) in real-world scenarios. kNN-MT is a retrieval-augmented framework that combines the advantages of parametric models with non-parametric datastores built using a set of parallel sentences. Previous studies have primarily focused on evaluating the model using only the BLEU metric and have not tested kNN-MT in real world scenarios. Our study aims to fill this gap by conducting a comprehensive analysis on various datasets comprising different language pairs and different domains, using multiple automatic metrics and expert evaluated Multidimensional Quality Metrics (MQM). We compare kNN-MT with two alternate strategies: fine-tuning all the model parameters and adapter-based finetuning. Finally, we analyze the effect of the datastore size on translation quality, and we examine the number of entries necessary to bootstrap and configure the index.

pdf bib abs
Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation
Shenbin Qian | Constantin Orasan | Felix Do Carmo | Qiuliang Li | Diptesh Kanojia

In this paper, we focus on how current Machine Translation (MT) engines perform on the translation of emotion-loaded texts by evaluating outputs from Google Translate according to a framework proposed in this paper. We propose this evaluation framework based on the Multidimensional Quality Metrics (MQM) and perform detailed error analyses of the MT outputs. From our analysis, we observe that about 50% of MT outputs are erroneous in preserving emotions. After further analysis of the erroneous examples, we find that emotion carrying words and linguistic phenomena such as polysemous words, negation, abbreviation etc., are common causes for these translation errors.

pdf bib abs
Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT
Benoist Wolleb | Romain Silvestri | Georgios Vernikos | Ljiljana Dolamic | Andrei Popescu-Belis

Subword tokenization is the de-facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently put forward in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality, thanks to the use of Huffman coding, which tokenizes words using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for approximately 90% of the BLEU scores reached by BPE, hence compositionality has less importance than previously thought.

pdf bib abs
What Works When in Context-aware Neural Machine Translation?
Harritxu Gete | Thierry Etchegoyhen | Gorka Labaka

Document-level Machine Translation has emerged as a promising means to enhance automated translation quality, but it is currently unclear how effectively context-aware models use the available context during translation. This paper aims to provide insight into the current state of models based on input concatenation, with an in-depth evaluation on English–German and English–French standard datasets. We notably evaluate the impact of data bias, antecedent part-of-speech, context complexity, and the syntactic function of the elements involved in discursive phenomena. Our experimental results indicate that the selected models do improve the overall translation in context, with varying sensitivity to the different factors we examined. We notably show that the selected context-aware models operate markedly better on regular syntactic configurations involving subject antecedents and pronouns, with degraded performance as the configurations become more dissimilar.

pdf bib abs
Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM
Rachel Bawden | François Yvon

The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages. We focus on BLOOM’s multilingual ability by evaluating its machine translation performance across several datasets (WMT, Flores-101 and DiaBLa) and language pairs (high- and low-resourced). Our results show that 0-shot performance suffers from overgeneration and generating in the wrong language, but this is greatly improved in the few-shot setting, with very good results for a number of language pairs. We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.

pdf bib abs
The MT@BZ corpus: machine translation & legal language
Flavia De Camillis | Egon W. Stemle | Elena Chiocchetti | Francesco Fernicola

The paper reports on the creation, annotation and curation of the MT@BZ corpus, a bilingual (Italian–South Tyrolean German) corpus of machine-translated legal texts from the officially multilingual Province of Bolzano, Italy. It is the first human error-annotated corpus (using an adapted SCATE taxonomy) of machine-translated legal texts in this language combination that includes a lesser-used standard variety. The data of the project will be made available on GitHub and another repository. The output of the customized engine achieved notably better BLEU, TER and chrF2 scores than the baseline. Over 50% of the segments needed no human revision due to customization. The most frequent error categories were mistranslations and bilingual (legal) terminology errors. Our contribution brings fine-grained insights to Machine translation evaluation research, as it concerns a less common language combination, a lesser-used language variety and a societally relevant specialized domain. Such results are necessary to implement and inform the use of MT in institutional contexts of smaller language communities.

pdf bib abs
Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages
Sonal Sannigrahi | Rachel Bawden

Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts are already robust to cross-script differences even for relatively low-resource languages.

pdf bib abs
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Tom Kocmi | Christian Federmann

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22’s Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

pdf bib abs
State Spaces Aren’t Enough: Machine Translation Needs Attention
Ali Vardasbi | Telmo Pessoa Pires | Robin Schmidt | Stephan Peitz

Structured State Spaces for Sequences (S4) is a recently proposed sequence model with successful applications in various tasks, e.g. vision, language modelling, and audio. Thanks to its mathematical formulation, it compresses its input to a single hidden state, and is able to capture long range dependencies while avoiding the need for an attention mechanism. In this work, we apply S4 to Machine Translation (MT), and evaluate several encoder-decoder variants on WMT’14 and WMT’16. In contrast with the success in language modeling, we find that S4 lags behind the Transformer by approximately 4 BLEU points, and that it counter-intuitively struggles with long sentences. Finally, we show that this gap is caused by S4’s inability to summarize the full source sentence in a single hidden state, and show that we can close the gap by introducing an attention mechanism.

pdf bib abs
Automatic Discrimination of Human and Neural Machine Translation in Multilingual Scenarios
Malina Chichirau | Rik van Noord | Antonio Toral

We tackle the task of automatically discriminating between human and machine translations. As opposed to most previous work, we perform experiments in a multilingual setting, considering multiple languages and multilingual pretrained language models. We show that a classifier trained on parallel data with a single source language (in our case German–English) can still perform well on English translations that come from different source languages, even when the machine translations were produced by other systems than the one it was trained on. Additionally, we demonstrate that incorporating the source text in the input of a multilingual classifier improves (i) its accuracy and (ii) its robustness on cross-system evaluation, compared to a monolingual classifier. Furthermore, we find that using training data from multiple source languages (German, Russian and Chinese) tends to improve the accuracy of both monolingual and multilingual classifiers. Finally, we show that bilingual classifiers and classifiers trained on multiple source languages benefit from being trained on longer text sequences, rather than on sentences.

pdf bib abs
Adaptive Machine Translation with Large Language Models
Yasmin Moslem | Rejwanul Haque | John D. Kelleher | Andy Way

Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, real-time adaptation remains challenging. Large-scale language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain and style characteristics. This work aims to investigate how we can utilize in-context learning to improve real-time adaptive MT. Our extensive experiments show promising results at translation time. For example, GPT-3.5 can adapt to a set of in-domain sentence pairs and/or terminology while translating a new sentence. We observe that the translation quality with few-shot in-context learning can surpass that of strong encoder-decoder MT systems, especially for high-resource languages. Moreover, we investigate whether we can combine MT from strong encoder-decoder models with fuzzy matches, which can further improve translation quality, especially for less supported languages. We conduct our experiments across five diverse language pairs, namely English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French (EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES).

pdf bib abs
Segment-based Interactive Machine Translation at a Character Level
Angel Navarro | Miguel Domingo | Francisco Casacuberta

To produce high quality translations, human translators need to review and correct machine translation hypothesis in what it is known as post-editing. In order to reduce the human effort of this process, interactive machine translation proposed a collaborative framework in which human and machine work together to generate the translations. Among the many protocols proposed throughout the years, the segment-based one established a paradigm in which the post-editor was allowed to validate correct word sequences from a translation hypothesis and introduced a word correction to help the system improve the next hypothesis. In this work we propose an extension to this protocol: instead of having to the type the complete word correction, the system will complete the user’s correction while they are typing. We evaluated our proposal under a simulated environment, achieving a significant reduction of the human effort.

pdf bib abs
Gender-Fair Post-Editing: A Case Study Beyond the Binary
Manuel Lardelli | Dagmar Gromann

Machine Translation (MT) models are well-known to suffer from gender bias, especially for gender beyond a binary conception. Due to the multiplicity of language-specific strategies for gender representation beyond the binary, debiasing MT is extremely challenging. As an alternative, we propose a case study on gender-fair post-editing. In this study, six professional translators each post-edited three English to German machine translations. For each translation, participants were instructed to use a different gender-fair language strategy, that is, gender-neutral rewording, gender-inclusive characters, and a neosystem. The focus of this study is not on translation quality but rather on the ease of integrating gender-fair language into the post-editing process. Findings from non-participant observation and interviews show clear differences in temporal and cognitive effort between participants and strategy as well as in the success of using gender-fair language.

pdf bib abs
“Translationese” (and “post-editese”?) no more: on importing fuzzy conceptual tools from Translation Studies in MT research
Miguel A. Jimenez-Crespo

During recent years, MT research has imported a number of conceptual tools from Translation Studies such as “translationese” or “translation universals”. These notions were the object of intense conceptual debates in Corpus-Based Translation Studies (CBTS), and number of seminal publications and conference forums recommended substituting them by less problematic terms such as “the language of translation” or “typical” or “general features of translated language”. This paper critically analyses the arguments put forward in the early 2000’s in CBTS to against the use of these terms, and whether the same issues apply to current MT re-search using them. Here, the paper will discuss, (1) the impact of the negative or pejorative nature of the term “translationese” on the status of professional translators and translation products in academia and society (2) the danger of over-generalizations or overextending claims found in specific and very limited textual subsets, as well as (3) the need to reframe the search of tendencies in translated language away from “universals” towards probabilistic, situational or conditional tendencies. It will be argued that MT re-search would benefit from clearly defined terms to deal with notions related to language variation in specific new variants of translation, proposing neutral terms such as “NMT translated language” or “the language of NMT”, as well as “general features/ tendencies in NMT / PE translations”. A proposal will be made in order to reach a “convergence” of MT and TS research and the probabilistic and descriptive study of features of (NMT or human) translated language.

pdf bib abs
A social media NMT engine for a low-resource language combination
María Do Campo Bayón | Pilar Sánchez-Gijón

The aim of this article is to present a new Neural Machine Translation (NMT) from Spanish into Galician for the social media domain that was trained with a Twitter corpus. Our main goal is to outline the methods used to build the corpus and the steps taken to train the engine in a low-resource language context. We have evalu-ated the engine performance both with regular automatic metrics and with a new methodology based on the non-inferiority process and contrasted this information with an error classification human evalua-tion conducted by professional linguists. We will present the steps carried out fol-lowing the conclusions of a previous pilot study, describe the new process followed, analyze the new engine and present the final conclusions.

pdf bib abs
Analysing Mistranslation of Emotions in Multilingual Tweets by Online MT Tools
Hadeel Saadany | Constantin Orasan | Rocio Caro Quintana | Felix Do Carmo | Leonardo Zilio

It is common for websites that contain User-Generated Text (UGT) to provide an automatic translation option to reach out to their linguistically diverse users. In such scenarios, the process of translating the users’ emotions is entirely automatic with no human intervention, neither for post-editing, nor for accuracy checking. In this paper, we assess whether automatic translation tools can be a successful real-life utility in transferring emotion in multilingual tweets. Our analysis shows that the mistranslation of the source tweet can lead to critical errors where the emotion is either completely lost or flipped to an opposite sentiment. We identify linguistic phenomena specific to Twitter data which pose a challenge in translation of emotions and show how frequent these features are in different language pairs. We also show that commonly-used quality metrics can lend false confidence in the performance of online MT tools specifically when the source emotion is distorted in telegraphic messages such as tweets.

pdf bib abs
DataLitMT – Teaching Data Literacy in the Context of Machine Translation Literacy
Janiça Hackenbuchner | Ralph Krüger

This paper presents the DataLitMT project conducted at TH Koln – University of Applied Sciences. The project develops learning resources for teaching data literacy in its translation-specific form of professional machine translation (MT) literacy to students of translation and specialised communication programmes at BA and MA levels. We discuss the need for data literacy teaching in a translation/specialised communication context, present the three theoretical pillars of the project (consisting of a Professional MT Literacy Framework, an MT-specific data literacy framework and a competence matrix derived from these frameworks) and give an overview of the learning resources developed as part of the project.

pdf bib abs
Do Humans Translate like Machines? Students’ Conceptualisations of Human and Machine Translation
Salmi Leena | Aletta G. Dorst | Maarit Koponen | Katinka Zeven

This paper explores how students conceptualise the processes involved in human and machine translation, and how they describe the similarities and differences between them. The paper presents the results of a survey involving university students (B.A. and M.A.) taking a course on translation who filled out an online questionnaire distributed in Finnish, Dutch and English. Our study finds that students often describe both human translation and machine translation in similar terms, suggesting they do not sufficiently distinguish between them and do not fully understand how machine translation works. The current study suggests that training in Machine Translation Literacy may need to focus more on the conceptualisations involved and how conceptual and vernacular misconceptions may affect how translators understand human and machine translation.

pdf bib abs
Adapting Machine Translation Education to the Neural Era: A Case Study of MT Quality Assessment
Lieve Macken | Bram Vanroy | Arda Tezcan

The use of automatic evaluation metrics to assess Machine Translation (MT) quality is well established in the translation industry. Whereas it is relatively easy to cover the word- and character-based metrics in an MT course, it is less obvious to integrate the newer neural metrics. In this paper we discuss how we introduced the topic of MT quality assessment in a course for translation students. We selected three English source texts, each having a different difficulty level and style, and let the students translate the texts into their L1 and reflect upon translation difficulty. Afterwards, the students were asked to assess MT quality for the same texts using different methods and to critically reflect upon obtained results. The students had access to the MATEO web interface, which contains word- and character-based metrics as well as neural metrics. The students used two different reference translations: their own translations and professional translations of the three texts. We not only synthesise the comments of the students, but also present the results of some cross-lingual analyses on nine different language pairs.

pdf bib abs
PE effort and neural-based automatic MT metrics: do they correlate?
Sergi Alvarez | Antoni Oliver

Neural machine translation (NMT) has shown overwhelmingly good results in recent times. This improvement in quality has boosted the presence of NMT in nearly all fields of translation. Most current translation industry workflows include postediting (PE) of MT as part of their process. For many domains and language combinations, translators post-edit raw machine translation (MT) to produce the final document. However, this process can only work properly if the quality of the raw MT output can be assured. MT is usually evaluated using automatic scores, as they are much faster and cheaper. However, traditional automatic scores have not been good quality indicators and do not correlate with PE effort. We analyze the correlation of each of the three dimensions of PE effort (temporal, technical and cognitive) with COMET, a neural framework which has obtained outstanding results in recent MT evaluation campaigns.

pdf bib abs
Migrant communities living in the Netherlands and their use of MT in healthcare settings
Susana Valdez | Ana Guerberof Arenas | Kars Ligtenberg

As part of a larger project on the use of MT in healthcare settings among migrant communities, this paper investigates if, when, how and with what (potential) challenges migrants use MT based on a survey of 201 non-native speakers of Dutch currently living in the Netherlands. Three main findings stand out from our analysis. First, most migrants use MT to understand health information in Dutch and communicate with health professionals. How MT is used and received varies depending on the context and the L2 language level, as well as age, but not on the educational level. Second, some users face challenges of different kinds, including a lack of trust or perceived inaccuracies. Some of these challenges are related to comprehension, which brings us to our third point. We argue that a more nuanced understanding of medical translation is needed in expert-to-non-expert health communication. This questionnaire helped us identify several topics we hope to explore in the project’s next phase.

pdf bib abs
Measuring Machine Translation User Experience (MTUX): A Comparison between AttrakDiff and User Experience Questionnaire
Vicent Briva-Iglesias | Sharon O’Brien

Perceptions and experiences of machine translation (MT) users before, during, and after their interaction with MT systems, products or services has been overlooked both in academia and in industry. Tradi-tionally, the focus has been on productivi-ty and quality, often neglecting the human factor. We propose the concept of Ma-chine Translation User Experience (MTUX) for assessing, evaluating, and getting further information about the user experiences of people interacting with MT. By conducting a human-computer in-teraction (HCI)-based study with 15 pro-fessional translators, we analyse which is the best method for measuring MTUX, and conclude by suggesting the use of the User Experience Questionnaire (UEQ). The measurement of MTUX will help eve-ry stakeholder in the MT industry - devel-opers will be able to identify pain points for the users and solve them in the devel-opment process, resulting in better MTUX and higher adoption of MT systems or products by MT users.

pdf bib abs
Coming to Terms with Glossary Enforcement: A Study of Three Approaches to Enforcing Terminology in NMT
Fred Bane | Anna Zaretskaya | Tània Blanch Miró | Celia Soler Uguet | João Torres

Enforcing terminology constraints is less straight-forward in neural machine translation (NMT) than statistical machine translation. Current methods, such as alignment-based insertion or the use of factors or special tokens, each have their strengths and drawbacks. We describe the current state of research on terminology enforcement in transformer-based NMT models, and present the results of our investigation into the performance of three different approaches. In addition to reference based quality metrics, we also evaluate the linguistic quality of the translations thus produced. Our results show that each approach is effective, though a negative impact on translation fluency remains evident.

pdf bib abs
Quality Analysis of Multilingual Neural Machine Translation Systems and Reference Test Translations for the English-Romanian language pair in the Medical Domain
Miguel Angel Rios Gaona | Raluca-Maria Chereji | Alina Secara | Dragos Ciobanu

Multilingual Neural Machine Translation (MNMT) models allow to translate across multiple languages based on only one system. We study the quality of a domain-adapted MNMT model in the medical domain for English-Romanian with automatic metrics and a human error typology annotation based on the Multidimensional Quality Metrics (MQM). We further expand the MQM typology to include terminology-specific error categories. We compare the out-of-domain MNMT with the in-domain adapted MNMT on a standard test dataset of abstracts from medical publications. The in-domain MNMT model outperforms the out-of-domain MNMT in all measured automatic metrics and produces fewer errors. In addition, we perform the manual annotation over the reference test dataset to study the quality of the reference translations. We identify a high number of omissions, additions, and mistranslations in the reference dataset, and comment on the assumed accuracy of existing datasets. Finally, we compare the correlation between the COMET, BERTScore, and chrF automatic metrics with the MQM annotated translations. COMET shows a better correlation with the MQM scores compared to the other metrics.

pdf bib abs
Computational analysis of different translations: by professionals, students and machines
Maja Popovic | Ekaterina Lapshinova-Koltunski | Maarit Koponen

In this work, we analyse different translated texts in terms of various text features. We compare two types of human translations, professional and students’, and machine translation outputs in terms of lexical and grammatical variety, sentence length,as well as frequencies of different POS tags and POS-trigrams. Our experimentsare carried out on parallel translations into three languages, Croatian, Finnish andRussian, all originating from the same source English texts. Our results indicatethat machine translations are closest to the source text, followed by student translations. Also, student translations are similar both to professional as well as to MT, sometimes even more to MT. Furthermore, we identify sets of features which are convenient for distinguishing machine from human translations.

pdf bib abs
Quality in Human and Machine Translation: An Interdisciplinary Survey
Bettina Hiebl | Dagmar Gromann

Quality assurance is a central component of human and machine translation. In translation studies, translation quality focuses on human evaluation and dimensions, such as purpose, comprehensibility, target audience among many more. Within the field of machine translation, more operationalized definitions of quality lead to automated metrics relying on reference translations or quality estimation. A joint approach to defining and assessing translation quality holds the promise to be mutually beneficial. To contribute towards that objective, this systematic survey provides an interdisciplinary analysis of the concept of translation quality from both perspectives. Thereby, it seeks to inspire cross-fertilization between both fields and further development of an interdisciplinary concept of translation quality.

pdf bib abs
How can machine translation help generate Arab melodic improvisation?
Fadi Al-Ghawanmeh | Alexander Jensenius | Kamel Smaili

This article presents a system to generate Arab music improvisation using machine translation. To reach this goal, we developed a machine translation model to translate a vocal improvisation into an automatic instrumental oud (Arab lute) response. Given the melodic and non-metric musical form, it was necessary to develop efficient textual representations for classical machine translation models to be as successful as NLP applications. We experimented with SMT and NMT to train our parallel corpus (Vocal to Instrumental) of 6991 sentences. The best model was then used to generate improvisation by iteratively translating thThis article presents a system to generate Arab music improvisation using machine translation (MT). To reach this goal, we developed a MT model to translate a vocal improvisation into an automatic instrumental oud (Arab lute) response. Given the melodic and non-metric musical form, it was necessary to develop efficient textual representations in order for classical MT models to be as successful as in common NLP applications. We experimented with Statistical and Neural MT to train our parallel corpus (Vocal to Instrument) of 6991 sentences. The best model was then used to generate improvisation by iteratively translating the translations of the most common patterns of each maqām (n-grams), producing elaborated variations conditioned to listener feedback. We constructed a dataset of 717 instrumental improvisations to extract their n-grams. Objective evaluation of MT was conducted at two levels: a sentence-level evaluation using the BLEU metric, and a higher level evaluation using musically informed metrics. Objective measures were consistent with one another. Subjective evaluations by experts from the maqām music tradition were promising, and a useful reference for understanding objective results.e translations of the most common patterns of each maqām (n-grams), producing elaborated variations conditioned to listener feedback. We constructed a dataset of 717 instrumental improvisations to extract their n-grams. Objective evaluation of machine translation was conducted at two levels: a sentence-level evaluation using the BLEU metric, and a higher level evaluation using musically informed metrics. Objective measures were consistent with one another. Subjective evaluations by experts from the maqām music tradition were promising, and a useful reference for understanding objective results.

pdf bib abs
Do online Machine Translation Systems Care for Context? What About a GPT Model?
Sheila Castilho | Clodagh Quinn Mallon | Rahel Meister | Shengya Yue

This paper addresses the challenges of evaluating document-level machine translation (MT) in the context of recent advances in context-aware neural machine translation (NMT). It investigates how well online MT systems deal with six context-related issues, namely lexical ambiguity, grammatical gender, grammatical number, reference, ellipsis, and terminology, when a larger context span containing the solution for those issues is given as input. Results are compared to the translation outputs from the online ChatGPT. Our results show that, while the change of punctuation in the input yields great variability in the output translations, the context position does not seem to have a great impact. Moreover, the GPT model seems to outperform the NMT systems but performs poorly for Irish. The study aims to provide insights into the effectiveness of online MT systems in handling context and highlight the importance of considering contextual factors in evaluating MT systems.

Although machine translation systems are mostly designed to serve in the general domain, there is a growing tendency to adapt these systems to other domains like literary translation. In this paper, we focus on English-Turkish literary translation and develop machine translation models that take into account the stylistic features of translators. We fine-tune a pre-trained machine translation model by the manually-aligned works of a particular translator. We make a detailed analysis of the effects of manual and automatic alignments, data augmentation methods, and corpus size on the translations. We propose an approach based on stylistic features to evaluate the style of a translator in the output translations. We show that the human translator style can be highly recreated in the target machine translations by adapting the models to the style of the translator.

pdf bib abs
Machine translation of anonymized documents with human-in-the-loop
Konstantinos Chatzitheodorou | M. Ángeles García Escrivá | Carmen Grau Lacal

In this paper, we introduce a workflow that utilizes human-in-the-loop for post-editing anonymized texts, with the aim of reconciling the competing needs of data privacy and data quality. By combining the strengths of machine translation and human post-editing, our methodology facilitates the efficient and effective translation of anonymized texts, while ensuring the confidentiality of sensitive information. Our experimental results validate that this approach is capable of providing all necessary information to the translators for producing high-quality translations effectively. Overall, our workflow offers a promising solution for organizations seeking to achieve both data privacy and data quality in their translation processes.

This work proposes an approach to use Part-Of-Speech (POS) information to automatically detect context-dependent Translation Units (TUs) from a Translation Memory database pertaining to the customer support domain. In line with our goal to minimize context-dependency in TUs, we show how this mechanism can be deployed to create new gender-neutral and context-independent TUs. Our experiments, conducted across Portuguese (PT), Brazilian Portuguese (PT-BR), Spanish (ES), and Spanish-Latam (ES-LATAM), show that the occurrence of certain POS with specific words is accurate in identifying context dependency. In a cross-client analysis, we found that ~10% of the most frequent 13,200 TUs were context-dependent, with gender determining context-dependency in 98% of all confirmed cases. We used these findings to suggest gender-neutral equivalents for the most frequent TUs with gender constraints. Our approach is in use in the Unbabel translation pipeline, and can be integrated into any other Neural Machine Translation (NMT) pipeline.

pdf bib abs
Improving Machine Translation in the E-commerce Luxury Space. A case study
José-Manuel De-la-Torre-Vilariño | Juan-Luis García-Mendoza | Alessia Petrucci

This case study presents a Multilingual e-commerce Project, which principal aim is to create an improved system that translates product titles and descriptions, plus other content in multiple languages. The project consisted of two main phases; a research-intensive solution using state-of-the-art Machine Translation systems and baseline language models for two language pairs, and the development of a Machine Translation system. The features implemented included Quality Estimation, model benchmarking, entity recognizers, and automatic domain detection. mBART model was used to create the system for the specific domain of e-commerce, for luxury items.

This paper illustrates a new methodology based on Test Suites (Avramidis et al., 2018) with focus on Business Critical Errors (BCEs) (Stewart et al., 2022) to evaluate the output of Machine Translation (MT) and Quality Estimation (QE) systems. We demonstrate the value of relying on semi-automatic evaluation done through scalable BCE-focused Test Suites to monitor both MT and QE systems’ performance for 8 language pairs (LPs) and a total of 4 error categories. This approach allows us to not only track the impact of new features and implementations in a real business environment, but also to identify strengths and weaknesses in models regarding different error types, and subsequently know what to improve henceforth.

In the context of an epidemiological study involving multilingual social media, this paper reports on the ability of machine translation systems to preserve content relevant for a document classification task designed to determine whether the social media text is related to covid. The results indicate that machine translation does provide a feasible basis for scaling epidemiological social media surveillance to multiple languages. Moreover, a qualitative error analysis revealed that the majority of classification errors are not caused by MT errors.

The European Patent Office (EPO) is an international organisation responsible for granting patents and promoting global cooperation in the intellectual property world. With three official languages (English, German, French) and a need to constantly access and manipulate information in multiple languages, machine translation is essential for the EPO. Over the last years we have developed internal machine translation engines, specifically for the translation of patent language. This article presents our data generation strategy: it describes our approach to the generation of parallel corpora of documents, training datasets of aligned sentences, and respective evaluation datasets. Details on the challenges and technical implementation are presented, as well as statistics of the training dataset generation process.

pdf bib abs
Terminology in Neural Machine Translation: A Case Study of the Canadian Hansard
Rebecca Knowles | Samuel Larkin | Marc Tessier | Michel Simard

Incorporating terminology into a neural machine translation (NMT) system is a feature of interest for many users of machine translation. In this case study of English-French Canadian Parliamentary text, we examine the performance of standard NMT systems at handling terminology and consider the tradeoffs between potential performance improvements and the efforts required to maintain terminological resources specifically for NMT.

pdf bib abs
Developing User-centred Approaches to Technological Innovation in Literary Translation (DUAL-T)
Paola Ruffo | Joke Daems | Lieve Macken

DUAL-T is an EU-funded project which aims at involving literary translators in the testing of technology-inclusive workflows. Participants will be asked to translate three short stories using, respectively, (1) a text editor combined with online resources, (2) a Computer-Aided Translation (CAT) tool, and (3) a Machine Translation Post-editing (MTPE) tool.

pdf bib abs
The Post-Edit Me! project
Marie-Aude Lefer | Romane Bodart | Adam Obrusnik | Justine Piette

In this paper, we present the Post-Edit Me! project, which aims to support machine translation post-editing training and learning in translator education, with particular emphasis on quality evaluation of students’ productions. We describe the main components of the project, from the perspectives of both translation lecturers and translation students, and the project’s outcomes to date, namely the MTPEAS annotation system used to assess students’ post-edited texts and the postedit.me app we are currently developing to automate the evaluation workflow.

The main goal of this project is to explore the techniques for training NMT systems applied to Spanish, Portuguese, Catalan, Galician, Asturian, Aragonese and Aranese. These languages belong to the same Romance family, but they are very different in terms of the linguistic resources available. Asturian, Aragonese and Aranese can be considered low resource languages. These characteristics make this setting an excellent place to explore training techniques for low-resource languages: transfer learning and multilingual systems, among others. The first months of the project have been dedicated to the compilation of monolingual and parallel corpora for Asturian, Aragonese and Aranese.

pdf bib abs
GAMETRAPP: Training app for post-editing neural machine trans-lation using gamification in professional settings
Cristina Toledo Báez

The GAMETRAPP project, funded by Spanish Ministry for Science and Inno-vation, aims to facilitate industry profes-sionals training on full post-editing of neural machine translation by means of a gamified environment.

pdf bib abs
MATEO: MAchine Translation Evaluation Online
Bram Vanroy | Arda Tezcan | Lieve Macken

We present MAchine Translation Evaluation Online (MATEO), a project that aims to facilitate machine translation (MT) evaluation by means of an easy-to-use interface that can evaluate given machine translations with a battery of automatic metrics. It caters to both experienced and novice users who are working with MT, such as MT system builders, teachers and students of (machine) translation, and researchers.

SignON (https://signon-project.eu/) is a Horizon 2020 project, running from 2021 until the end of 2023, which addresses the lack of technology and services for the automatic translation between sign languages (SLs) and spoken languages, through an inclusive, human-centric solution, hence contributing to the repertoire of communication media for deaf, hard of hearing (DHH) and hearing individuals. In this paper, we present an update of the status of the project, describing the approaches developed to address the challenges and peculiarities of SL machine translation (SLMT).

pdf bib abs
GoSt-ParC-Sign: Gold Standard Parallel Corpus of Sign and spoken language
Mirella De Sisto | Vincent Vandeghinste | Lien Soetemans | Caro Brosens | Dimitar Shterionov

Good quality training data for Sign Language Machine Translation (SLMT) is extremely scarce, and this is one of the challenges that any project focusing on Machine Translation (MT) which also targets sign languages is currently facing. The goal of this ongoing project is to create a parallel corpus of authentic Flemish Sign Language (VGT) and written Dutch which can be employed as gold standard in automated sign language translation. The availability of a gold standard corpus like Gost-ParC-Sign can facilitate the advances of SLMT; consequently, it supports and promotes inclusiveness in MT and, on a more general level, in language technology

We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. To date, parallel and monolingual corpora have been produced for seven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.

This paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website (https://www.wmt-slt.com/) or in the findings paper (Müller et al., 2022).

pdf bib abs
DECA: Democratic epistemic capacities in the age of algorithms
Maarit Koponen | Mary Nurminen | Nina Havumetsä | Juha Lång

The DECA project consortium investigates epistemic capacities, defined as an individual’s access to reliable knowledge, their ability to participate in knowledge production, and society’s capacity to make informed, sustainable policy decisions. In this paper, we focus specifically on the parts of the project examining the challenges posed by multilinguality in these processes and the potential role of MT in supporting access to, and production of, knowledge.

pdf bib abs
CorCoDial - Machine translation techniques for corpus-based computational dialectology
Yves Scherrer | Olli Kuparinen | Aleksandra Miletic

This paper presents CorCoDial, a research project funded by the Academy of Finland aiming to leverage machine translation technology for corpus-based computational dialectology. In this paper, we briefly present intermediate results of our project-related research.

pdf bib abs
How STAR Transit NXT can help translators measure and increase their MT post-editing efficiency
Julian Hamm | Judith Klein

As machine translation (MT) is being more tightly integrated into modern CAT-based translation workflows, measuring and increasing MT efficiency has become one of the main concerns of LSPs and companies trying to optimise their processes in terms of quality and performance. When it comes to measur-ing MT efficiency, STAR’s CAT tool Transit NXT offers post-editing distance (PED) and MT error categorisation as two core features of Transit’s compre-hensive QA module. With DeepL glossa-ry integration and MT confidence scores, translators will also have access to two new features which can help them in-crease their MT post-editing efficiency.

PROPICTO is a project funded by the French National Research Agency and the Swiss National Science Foundation, that aims at creating Speech-to-Pictograph translation systems, with a special focus on French as an input language. By developing such technologies, we intend to enhance communication access for non-French speaking patients and people with cognitive impairments.

We describe the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with large-scale model training. It will derive monolingual and bilingual datasets from the Internet Archive and CommonCrawl and build efficient and solid machine translation (MT) as well as large language models (LLMs). HPLT aims at providing free, sustainable and reusable datasets, models and workflows at scale using high-performance computing (HPC).