Krzysztof Marasek


2017

pdf bib
PJIIT’s systems for WMT 2017 Conference
Krzysztof Wolk | Krzysztof Marasek
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
PJAIT Systems for the WMT 2016
Krzysztof Wolk | Krzysztof Marasek
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf bib
PJAIT systems for the IWSLT 2015 evaluation campaign enhanced by comparable corpora
Krzysztof Wolk | Krzysztof Marasek
Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents
Krzysztof Wolk | Krzysztof Marasek
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

2014

pdf bib
Polish-English speech statistical machine translation systems for the IWSLT 2014
Krzysztof Wolk | Krzysztof Marasek
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

This research explores effects of various training settings between Polish and English Statistical Machine Translation systems for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2014 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system as well as Wikipedia based comparable corpora prepared by us. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of data preparations on translation results. Our experiments included systems, which use lemma and morphological information on Polish words. We also conducted a deep analysis of provided Polish data as preparatory work for the automatic data correction and cleaning phase.

2013

pdf bib
Polish-English speech statistical machine translation systems for the IWSLT 2013
Krzysztof Wolk | Krzysztof Marasek
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign

This research explores the effects of various training settings from Polish to English Statistical Machine Translation system for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2013 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of data preparations on translation results. Our experiments included systems, which use stems and morphological information on Polish words. We also conducted a deep analysis of provided Polish data as preparatory work for the automatic data correction and cleaning phase.

2012

pdf bib
TED Polish-to-English translation system for the IWSLT 2012
Krzysztof Marasek
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper presents efforts in preparation of the Polish-to-English SMT system for the TED lectures domain that is to be evaluated during the IWSLT 2012 Conference. Our attempts cover systems which use stems and morphological information on Polish words (using two different tools) and stems and POS.

2008

pdf bib
Design and Data Collection for Spoken Polish Dialogs Database
Krzysztof Marasek | Ryszard Gubrynowicz
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Spoken corpora provide a critical resource for research, development and evaluation of spoken dialog systems. This paper describes the telephone spoken dialog corpus for Polish created by Polish-Japanese Institute of Information Technology team within the LUNA project (IST 033549). The main goal of this project is to create a robust natural spoken language understanding (SLU) toolkit, which can be used to improve the speech-enabled telecom services in multilingual context (Italian, French and Polish). The corpus has been collected at the call center of Warsaw Transport Authority, manually transcribed and richly annotated on acoustic, syntactic and semantic levels. The most frequent users’ requests concern city traffic information (public transportation stops, routes, schedules, trip planning etc.). The collected database consists of two parts: 500 human-human dialogs of approx. 670 minutes long with a vocabulary of ca. 8,000 words and 500 human-machine dialogs recorded via the use of Wizard-of-Oz paradigm. The syntactic and semantic annotation is carried out by another team (Mykowiecka et al., 2007). This database is the first one collected for spontaneous Polish speech recorded through telecommunication lines and will be used for development and evaluation of automatic speech recognition (ASR) and robust natural spoken language understanding (SLU) components.

2002

pdf bib
SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation
Dorota Iskra | Beate Grosskopf | Krzysztof Marasek | Henk van den Heuvel | Frank Diehl | Andreas Kiessling
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
SPEECON - Speech Data for Consumer Devices
Rainer Siemund | Harald Höge | Siegfried Kunzmann | Krzysztof Marasek
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)