Richárd Farkas

Also published as: Richard Farkas

2023

Within the research presented in this article, we created a new question answering benchmark database for Hungarian called MILQA. When creating the dataset, we basically followed the principles of the English SQuAD 2.0, however, like in some more recent English question answering datasets, we introduced a number of innovations beyond SQuAD: e.g., yes/no-questions, list-like answers consisting of several text spans, long answers, questions requiring calculation and other question types where you cannot simply copy the answer from the text. For all these non-extractive question types, the pragmatically adequate form of the answer was also added to make the training of generative models possible. We implemented and evaluated a set of baseline retrieval and answer span extraction models on the dataset. BM25 performed better than any vector-based solution for retrieval. Cross-lingual transfer from English significantly improved span extraction models.

2018

pdf bib
SzegedKoref: A Hungarian Coreference Corpus
Veronika Vincze | Klára Hegedűs | Alex Sliz-Nagy | Richárd Farkas
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Universal Dependencies and Morphology for Hungarian - and on the Price of Universality
Veronika Vincze | Katalin Simkó | Zsolt Szántó | Richárd Farkas
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper, we present how the principles of universal dependencies and morphology have been adapted to Hungarian. We report the most challenging grammatical phenomena and our solutions to those. On the basis of the adapted guidelines, we have converted and manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format. We also introduce experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. Our results reveal that converting to universal dependencies is not necessarily trivial, moreover, using language-specific morphological features may have an impact on overall performance.

2014

pdf bib
Introducing the IMS-Wrocław-Szeged-CIS entry at the SPMRL 2014 Shared Task: Reranking and Morpho-syntax meet Unlabeled Data
Anders Björkelund | Özlem Çetinoğlu | Agnieszka Faleńska | Richárd Farkas | Thomas Mueller | Wolfgang Seeker | Zsolt Szántó
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

pdf bib abs
Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus
Veronika Vincze | Viktor Varga | Katalin Ilona Simkó | János Zsibrita | Ágoston Nagy | Richárd Farkas | János Csirik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Szeged Corpus is the largest manually annotated database containing the possible morphological analyses and lemmas for each word form. In this work, we present its latest version, Szeged Corpus 2.5, in which the new harmonized morphological coding system of Hungarian has been employed and, on the other hand, the majority of misspelled words have been corrected and tagged with the proper morphological code. New morphological codes are introduced for participles, causative / modal / frequentative verbs, adverbial pronouns and punctuation marks, moreover, the distinction between common and proper nouns is eliminated. We also report some statistical data on the frequency of the new morphological codes. The new version of the corpus made it possible to train magyarlanc, a data-driven POS-tagger of Hungarian on a dataset with the new harmonized codes. According to the results, magyarlanc is able to achieve a state-of-the-art accuracy score on the 2.5 version as well.

pdf bib
An Empirical Evaluation of Automatic Conversion from Constituency to Dependency in Hungarian
Katalin Ilona Simkó | Veronika Vincze | Zsolt Szántó | Richárd Farkas
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Special Techniques for Constituent Parsing of Morphologically Rich Languages
Zsolt Szántó | Richárd Farkas
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Dependency parsing with latent refinements of part-of-speech tags
Thomas Mueller | Richard Farkas | Alex Judea | Helmut Schmid | Hinrich Schuetze
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
SZTE-NLP: Aspect level opinion mining exploiting syntactic cues
Viktor Hangya | Gábor Berend | István Varga | Richárd Farkas
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
SZTE-NLP: Clinical Text Analysis with Named Entity Recognition
Melinda Katona | Richárd Farkas
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2013

pdf bib
Full-coverage Identification of English Light Verb Constructions
István Nagy T. | Veronika Vincze | Richárd Farkas
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Keyphrase-Driven Document Visualization Tool
Gábor Berend | Richárd Farkas
The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations

pdf bib
Knowledge Sources for Constituent Parsing of German, a Morphologically Rich and Less-Configurational Language
Alexander Fraser | Helmut Schmid | Richárd Farkas | Renjing Wang | Hinrich Schütze
Computational Linguistics, Volume 39, Issue 1 - March 2013

pdf bib
SZTE-NLP: Sentiment Detection on Twitter Messages
Viktor Hangya | Gábor Berend | Richárd Farkas
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

Joint morphological and syntactic analysis has been proposed as a way of improving parsing accuracy for richly inflected languages. Starting from a transition-based model for joint part-of-speech tagging and dependency parsing, we explore different ways of integrating morphological features into the model. We also investigate the use of rule-based morphological analyzers to provide hard or soft lexical constraints and the use of word clusters to tackle the sparsity of lexical features. Evaluation on five morphologically rich languages (Czech, Finnish, German, Hungarian, and Russian) shows consistent improvements in both morphological and syntactic accuracy for joint prediction over a pipeline model, with further improvements thanks to lexical constraints and word clusters. The final results improve the state of the art in dependency parsing for all languages.

pdf bib
magyarlanc: A Tool for Morphological and Dependency Parsing of Hungarian
János Zsibrita | Veronika Vincze | Richárd Farkas
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach
Veronika Vincze | István Nagy T. | Richárd Farkas
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Munich-Edinburgh-Stuttgart Submissions of OSM Systems at WMT13
Nadir Durrani | Alexander Fraser | Helmut Schmid | Hassan Sajjad | Richárd Farkas
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
LFG-based Features for Noun Number and Article Grammatical Errors
Gábor Berend | Veronika Vincze | Sina Zarrieß | Richárd Farkas
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
(Re)ranking Meets Morphosyntax: State-of-the-art Results from the SPMRL 2013 Shared Task
Anders Björkelund | Özlem Çetinoğlu | Richárd Farkas | Thomas Mueller | Wolfgang Seeker
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage, and to have more than one sense considered frequent in usage. HNC and its Heti Világgazdaság subcorpus provided the basis for corpus text selection. This way, each sample has a relevant context (whole article), and information on the lemma, POS-tagging and automatic tokenization is also available. When planning the corpus, 300-500 samples of each word form were to be annotated. This size makes it possible that the subcorpora prepared for the individual word forms can be compared to data available for other languages. However, the finalized database also contains unannotated samples and samples with single annotation, which were annotated only by one of the linguists. The corpus follows the ACLs SensEval/SemEval WSD tasks format. The first version of the corpus was developed within the scope of the project titled The construction Hungarian WordNet Ontology and its application in Information Extraction Systems (Hatvani et al., 2007). The corpus for research and educational purposes is available and can be downloaded free of charge.

2007

pdf bib
GYDER: Maxent Metonymy Resolution
Richárd Farkas | Eszter Simon | György Szarvas | Dániel Varga
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib abs
A highly accurate Named Entity corpus for Hungarian
György Szarvas | Richárd Farkas | László Felföldi | András Kocsor | János Csirik
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89%. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86% F measure on the corpus.