Maria Zimina


2023

pdf bib
Investigating Techniques for a Deeper Understanding of Neural Machine Translation (NMT) Systems through Data Filtering and Fine-tuning Strategies
Lichao Zhu | Maria Zimina | Maud Bénard | Behnoosh Namdar | Nicolas Ballier | Guillaume Wisniewski | Jean-Baptiste Yunès
Proceedings of the Eighth Conference on Machine Translation

In the context of this biomedical shared task, we have implemented data filters to enhance the selection of relevant training data for fine- tuning from the available training data sources. Specifically, we have employed textometric analysis to detect repetitive segments within the test set, which we have then used for re- fining the training data used to fine-tune the mBart-50 baseline model. Through this approach, we aim to achieve several objectives: developing a practical fine-tuning strategy for training biomedical in-domain fr<>en models, defining criteria for filtering in-domain training data, and comparing model predictions, fine-tuning data in accordance with the test set to gain a deeper insight into the functioning of Neural Machine Translation (NMT) systems.

pdf bib
Translating Dislocations or Parentheticals : Investigating the Role of Prosodic Boundaries for Spoken Language Translation of French into English
Nicolas Ballier | Behnoosh Namdarzadeh | Maria Zimina | Jean-Baptiste Yunès
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track

This paper examines some of the effects of prosodic boundaries on ASR outputs and Spoken Language Translations into English for two competing French structures (“c’est” dislocation vs. “c’est” parentheticals). One native speaker of French read 104 test sentences that were then submitted to two systems. We compared the outputs of two toolkits, SYSTRAN Pure Neural Server (SPNS9) (Crego et al., 2016) and Whisper. For SPNS9, we compared the translation of the text file used for the reading with the translation of the transcription generated through Vocapia ASR. We also tested the transcription engine for speech recognition uploading an MP3 file and used the same procedure for AI Whisper’s Web-scale Supervised Pretraining for Speech Recognition system (Radford et al., 2022). We reported WER for the transcription tasks and the BLEU scores for the different models. We evidenced the variability of the punctuation in the ASR outputs and discussed it in relation to the duration of the utterance. We discussed the effects of the prosodic boundaries. We described the status of the boundary in the speech-to-text systems, discussing the consequence for the neural machine translation of the rendering of the prosodic boundary by a comma, a full stop, or any other punctuation symbol. We used the reference transcript of the reading phase to compute the edit distance between the reference transcript and the ASR output. We also used textometric analyses with iTrameur (Fleury and Zimina, 2014) for insights into the errors that can be attributed to ASR or to Neural Machine translation.

2022

pdf bib
The SPECTRANS System Description for the WMT22 Biomedical Task
Nicolas Ballier | Jean-baptiste Yunès | Guillaume Wisniewski | Lichao Zhu | Maria Zimina
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes the SPECTRANS submission for the WMT 2022 biomedical shared task. We present the results of our experiments using the training corpora and the JoeyNMT (Kreutzer et al., 2019) and SYSTRAN Pure Neural Server/ Advanced Model Studio toolkits for the language directions English to French and French to English. We compare the pre- dictions of the different toolkits. We also use JoeyNMT to fine-tune the model with a selection of texts from WMT, Khresmoi and UFAL data sets. We report our results and assess the respective merits of the different translated texts.

2014

pdf bib
Trameur: A Framework for Annotated Text Corpora Exploration
Serge Fleury | Maria Zimina
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations