Ananya Mukherjee

2023

pdf bib abs
IIIT HYD’s Submission for WMT23 Test-suite Task
Ananya Mukherjee | Manish Shrivastava
Proceedings of the Eighth Conference on Machine Translation

This paper summarizes the results of our test suite evaluation on 12 machine translation systems submitted at the Shared Task of the 8th Conference of Machine Translation (WMT23) for English-German (en-de) language pair. Our test suite covers five specific domains (entertainment, environment, health, science, legal) and spans five distinct writing styles (descriptive, judgments, narrative, reporting, technical-writing). We present our analysis through automatic evaluation methods, conducted with a focus on domain-specific and writing style-specific evaluations.

pdf bib abs
MEE4 and XLsim : IIIT HYD’s Submissions’ for WMT23 Metrics Shared Task
Ananya Mukherjee | Manish Shrivastava
Proceedings of the Eighth Conference on Machine Translation

This paper presents our contributions to the WMT2023 shared metrics task, consisting of two distinct evaluation approaches: a) Unsupervised Metric (MEE4) and b) Supervised Metric (XLSim). MEE4 represents an unsupervised, reference-based assessment metric that quantifies linguistic features, encompassing lexical, syntactic, semantic, morphological, and contextual similarities, leveraging embeddings. In contrast, XLsim is a supervised reference-based evaluation metric, employing a Siamese Architecture, which regresses on Direct Assessments (DA) from previous WMT News Translation shared tasks from 2017-2022. XLsim is trained using XLM-RoBERTa (base) on English-German reference and mt pairs with human scores.

pdf bib abs
LTRC_IIITH’s 2023 Submission for Prompting Large Language Models as Explainable Metrics Task
Pavan Baswani | Ananya Mukherjee | Manish Shrivastava
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

In this report, we share our contribution to the Eval4NLP Shared Task titled “Prompting Large Language Models as Explainable Metrics.” We build our prompts with a primary focus on effective prompting strategies, score-aggregation, and explainability for LLM-based metrics. We participated in the track for smaller models by submitting the scores along with their explanations. According to the Kendall correlation scores on the leaderboard, our MT evaluation submission ranks second-best, while our summarization evaluation submission ranks fourth, with only a 0.06 difference from the leading submission.

2022

pdf bib abs
Unsupervised Embedding-based Metric for MT Evaluation with Improved Human Correlation
Ananya Mukherjee | Manish Shrivastava
Proceedings of the Seventh Conference on Machine Translation (WMT)

In this paper, we describe our submission to the WMT22 metrics shared task. Our metric focuses on computing contextual and syntactic equivalences along with lexical, morphological, and semantic similarity. The intent is to capture the fluency and context of the MT outputs along with their adequacy. Fluency is captured using syntactic similarity and context is captured using sentence similarity leveraging sentence embeddings. The final sentence translation score is the weighted combination of three similarity scores: a) Syntactic Similarity b) Lexical, Morphological and Semantic Similarity, and c) Contextual Similarity. This paper outlines two improved versions of MEE i.e., MEE2 and MEE4. Additionally, we report our experiments on language pairs of en-de, en-ru and zh-en from WMT17-19 testset and further depict the correlation with human assessments.

pdf bib abs
REUSE: REference-free UnSupervised Quality Estimation Metric
Ananya Mukherjee | Manish Shrivastava
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes our submission to the WMT2022 shared metrics task. Our unsupervised metric estimates the translation quality at chunk-level and sentence-level. Source and target sentence chunks are retrieved by using a multi-lingual chunker. The chunk-level similarity is computed by leveraging BERT contextual word embeddings and sentence similarity scores are calculated by leveraging sentence embeddings of Language-Agnostic BERT models. The final quality estimation score is obtained by mean pooling the chunk-level and sentence-level similarity scores. This paper outlines our experiments and also reports the correlation with human judgements for en-de, en-ru and zh-en language pairs of WMT17, WMT18 and WMT19 test sets.