Gergő Szabó


2023

pdf bib
A Question Answering Benchmark Database for Hungarian
Attila Novák | Borbála Novák | Tamás Zombori | Gergő Szabó | Zsolt Szántó | Richárd Farkas
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

Within the research presented in this article, we created a new question answering benchmark database for Hungarian called MILQA. When creating the dataset, we basically followed the principles of the English SQuAD 2.0, however, like in some more recent English question answering datasets, we introduced a number of innovations beyond SQuAD: e.g., yes/no-questions, list-like answers consisting of several text spans, long answers, questions requiring calculation and other question types where you cannot simply copy the answer from the text. For all these non-extractive question types, the pragmatically adequate form of the answer was also added to make the training of generative models possible. We implemented and evaluated a set of baseline retrieval and answer span extraction models on the dataset. BM25 performed better than any vector-based solution for retrieval. Cross-lingual transfer from English significantly improved span extraction models.