Michael Denkowski

2020

pdf bib
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Michael Denkowski | Christian Federmann
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
The Sockeye 2 Neural Machine Translation Toolkit at AMTA 2020
Tobias Domhan | Michael Denkowski | David Vilar | Xing Niu | Felix Hieber | Kenneth Heafield
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib abs
Sockeye 2: A Toolkit for Neural Machine Translation
Felix Hieber | Tobias Domhan | Michael Denkowski | David Vilar
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

We present Sockeye 2, a modernized and streamlined version of the Sockeye neural machine translation (NMT) toolkit. New features include a simplified code base through the use of MXNet’s Gluon API, a focus on state of the art model architectures, and distributed mixed precision training. These improvements result in faster training and inference, higher automatic metric scores, and a shorter path from research to production.

2018

pdf bib abs
Bi-Directional Neural Machine Translation with Synthetic Parallel Data
Xing Niu | Michael Denkowski | Marine Carpuat
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

Despite impressive progress in high-resource settings, Neural Machine Translation (NMT) still struggles in low-resource and out-of-domain scenarios, often failing to match the quality of phrase-based translation. We propose a novel technique that combines back-translation and multilingual NMT to improve performance in these difficult cases. Our technique trains a single model for both directions of a language pair, allowing us to back-translate source or target monolingual data without requiring an auxiliary model. We then continue training on the augmented parallel data, enabling a cycle of improvement for a single model that can incorporate any source, target, or parallel data to improve both translation directions. As a byproduct, these models can reduce training and deployment costs significantly compared to uni-directional models. Extensive experiments show that our technique outperforms standard back-translation in low-resource scenarios, improves quality on cross-domain tasks, and effectively reduces costs across the board.

2017

pdf bib abs
Stronger Baselines for Trustable Results in Neural Machine Translation
Michael Denkowski | Graham Neubig
Proceedings of the First Workshop on Neural Machine Translation

Interest in neural machine translation has grown rapidly as its effectiveness has been demonstrated across language and data scenarios. New research regularly introduces architectural and algorithmic improvements that lead to significant gains over “vanilla” NMT implementations. However, these new techniques are rarely evaluated in the context of previously published techniques, specifically those that are widely used in state-of-the-art production and shared-task systems. As a result, it is often difficult to determine whether improvements from research will carry over to systems deployed for real-world use. In this work, we recommend three specific methods that are relatively easy to implement and result in much stronger experimental systems. Beyond reporting significantly higher BLEU scores, we conduct an in-depth analysis of where improvements originate and what inherent weaknesses of basic NMT models are being addressed. We then compare the relative gains afforded by several other techniques proposed in the literature when starting with vanilla systems versus our stronger baselines, showing that experimental conclusions may change depending on the baseline chosen. This indicates that choosing a strong baseline is crucial for reporting reliable experimental results.

2014

pdf bib
Real Time Adaptive Machine Translation for Post-Editing with cdec and TransCenter
Michael Denkowski | Alon Lavie | Isabel Lacruz | Chris Dyer
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

pdf bib
Meteor Universal: Language Specific Translation Evaluation for Any Target Language
Michael Denkowski | Alon Lavie
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib abs
Cognitive demand and cognitive effort in post-editing
Isabel Lacruz | Michael Denkowski | Alon Lavie
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

The pause to word ratio, the number of pauses per word in a post-edited MT segment, is an indicator of cognitive effort in post-editing (Lacruz and Shreve, 2014). We investigate how low the pause threshold can reasonably be taken, and we propose that 300 ms is a good choice, as pioneered by Schilperoord (1996). We then seek to identify a good measure of the cognitive demand imposed by MT output on the post-editor, as opposed to the cognitive effort actually exerted by the post-editor during post-editing. Measuring cognitive demand is closely related to measuring MT utility, the MT quality as perceived by the post-editor. HTER, an extrinsic edit to word ratio that does not necessarily correspond to actual edits per word performed by the post-editor, is a well-established measure of MT quality, but it does not comprehensively capture cognitive demand (Koponen, 2012). We investigate intrinsic measures of MT quality, and so of cognitive demand, through edited-error to word metrics. We find that the transfer-error to word ratio predicts cognitive effort better than mechanical-error to word ratio (Koby and Champe, 2013). We identify specific categories of cognitively challenging MT errors whose error to word ratios correlate well with cognitive effort.

pdf bib abs
Real time adaptive machine translation: cdec and TransCenter
Michael Denkowski | Alon Lavie | Isabel Lacruz | Chris Dyer
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

cdec Realtime and TransCenter provide an end-to-end experimental setup for machine translation post-editing research. Realtime provides a framework for building adaptive MT systems that learn from post-editor feedback while TransCenter incorporates a web-based translation interface that connects users to these systems and logs post-editing activity. This combination allows the straightforward deployment of MT systems specifically for post-editing and analysis of translator productivity when working with adaptive systems. Both toolkits are freely available under open source licenses.

pdf bib
Learning from Post-Editing: Online Model Adaptation for Statistical Machine Translation
Michael Denkowski | Chris Dyer | Alon Lavie
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Analyzing and Predicting MT Utility and Post-Editing Productivity in Enterprise-scale Translation Projects
Alon Lavie | Olga Beregovaya | Michael Denkowski | David Clarke
Proceedings of Machine Translation Summit XIV: User track

2012

pdf bib
The CMU-Avenue French-English Translation System
Michael Denkowski | Greg Hanneman | Alon Lavie
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib abs
Challenges in Predicting Machine Translation Utility for Human Post-Editors
Michael Denkowski | Alon Lavie
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

As machine translation quality continues to improve, the idea of using MT to assist human translators becomes increasingly attractive. In this work, we discuss and provide empirical evidence of the challenges faced when adapting traditional MT systems to provide automatic translations for human post-editors to correct. We discuss the differences between this task and traditional adequacy-based tasks and the challenges that arise when using automatic metrics to predict the amount of effort required to post-edit translations. A series of experiments simulating a real-world localization scenario shows that current metrics under-perform on this task, even when tuned to maximize correlation with expert translator judgments, illustrating the need to rethink traditional MT pipelines when addressing the challenges of this translation task.

2011

pdf bib
Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems
Michael Denkowski | Alon Lavie
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib
Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level
Michael Denkowski | Alon Lavie
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk
Michael Denkowski | Alon Lavie
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
Michael Denkowski | Hassan Al-Haj | Alon Lavie
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support for Five Target Languages
Michael Denkowski | Alon Lavie
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib abs
Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks
Michael Denkowski | Alon Lavie
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper examines the motivation, design, and practical results of several types of human evaluation tasks for machine translation. In addition to considering annotator performance and task informativeness over multiple evaluations, we explore the practicality of tuning automatic evaluation metrics to each judgment type in a comprehensive experiment using the METEOR-NEXT metric. We present results showing clear advantages of tuning to certain types of judgments and discuss causes of inconsistency when tuning to various judgment data, as well as sources of difficulty in the human evaluation tasks themselves.