Jakub Sido

2023

pdf bib abs
MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain
Jan Pasek | Jakub Sido | Miloslav Konopik | Ondrej Prazak
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This work proposes a new pipeline for leveraging data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.

This paper summarizes the second edition of the shared task on multilingual coreference resolution, held with the CRAC 2023 workshop. Just like last year, participants of the shared task were to create trainable systems that detect mentions and group them based on identity coreference; however, this year’s edition uses a slightly different primary evaluation score, and is also broader in terms of covered languages: version 1.1 of the multilingual collection of harmonized coreference resources CorefUD was used as the source of training and evaluation data this time, with 17 datasets for 12 languages. 7 systems competed in this shared task.

2022

This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used as the source of training and evaluation data. The CoNLL score used in previous coreference-oriented shared tasks was used as the main evaluation metric. There were 8 coreference prediction systems submitted by 5 participating teams; in addition, there was a competitive Transformer-based baseline system provided by the organizers at the beginning of the shared task. The winner system outperformed the baseline by 12 percentage points (in terms of the CoNLL scores averaged across all datasets for individual languages).

2021

pdf bib abs
Multilingual Coreference Resolution with Harmonized Annotations
Ondřej Pražák | Miloslav Konopík | Jakub Sido
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al.,2021). We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models - for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.

pdf bib abs
Czert – Czech BERT-like Model for Language Representation
Jakub Sido | Ondřej Pražák | Pavel Přibáň | Jan Pašek | Michal Seják | Miloslav Konopík
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.

2020

pdf bib abs
UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection
Ondřej Pražák | Pavel Přibáň | Stephen Taylor | Jakub Sido
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We ranked 1st in Sub-task 1: binary change detection, and 4th in Sub-task 2: ranked change detection. We present our method which is completely unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and orthogonal transformation;and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.

Co-authors