Veronika Vincze


2023

pdf bib
PARSEME corpus release 1.3
Agata Savary | Cherifa Ben Khelil | Carlos Ramisch | Voula Giouli | Verginica Barbu Mititelu | Najet Hadj Mohamed | Cvetana Krstev | Chaya Liebeskind | Hongzhi Xu | Sara Stymne | Tunga Güngör | Thomas Pickard | Bruno Guillaume | Eduard Bejček | Archna Bhatia | Marie Candito | Polona Gantar | Uxoa Iñurrieta | Albert Gatt | Jolanta Kovalevskaite | Timm Lichte | Nikola Ljubešić | Johanna Monti | Carla Parra Escartín | Mehrnoush Shamsfard | Ivelina Stoyanova | Veronika Vincze | Abigail Walsh
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

2022

pdf bib
Linguistic Parameters of Spontaneous Speech for Identifying Mild Cognitive Impairment and Alzheimer Disease
Veronika Vincze | Martina Katalin Szabó | Ildikó Hoffmann | László Tóth | Magdolna Pákáski | János Kálmán | Gábor Gosztolya
Computational Linguistics, Volume 48, Issue 1 - March 2022

In this article, we seek to automatically identify Hungarian patients suffering from mild cognitive impairment (MCI) or mild Alzheimer disease (mAD) based on their speech transcripts, focusing only on linguistic features. In addition to the features examined in our earlier study, we introduce syntactic, semantic, and pragmatic features of spontaneous speech that might affect the detection of dementia. In order to ascertain the most useful features for distinguishing healthy controls, MCI patients, and mAD patients, we carry out a statistical analysis of the data and investigate the significance level of the extracted features among various speaker group pairs and for various speaking tasks. In the second part of the article, we use this rich feature set as a basis for an effective discrimination among the three speaker groups. In our machine learning experiments, we analyze the efficacy of each feature group separately. Our model that uses all the features achieves competitive scores, either with or without demographic information (3-class accuracy values: 68%–70%, 2-class accuracy values: 77.3%–80%). We also analyze how different data recording scenarios affect linguistic features and how they can be productively used when distinguishing MCI patients from healthy controls.

2020

pdf bib
Pártélet: A Hungarian Corpus of Propaganda Texts from the Hungarian Socialist Era
Zoltán Kmetty | Veronika Vincze | Dorottya Demszky | Orsolya Ring | Balázs Nagy | Martina Katalin Szabó
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present Pártélet, a digitized Hungarian corpus of Communist propaganda texts. Pártélet was the official journal of the governing party during the Hungarian socialism from 1956 to 1989, hence it represents the direct political agitation and propaganda of the dictatorial system in question. The paper has a dual purpose: first, to present a general review of the corpus compilation process and the basic statistical data of the corpus, and second, to demonstrate through two case studies what the dataset can be used for. We show that our corpus provides a unique opportunity for conducting research on Hungarian propaganda discourse, as well as analyzing changes of this discourse over a 35-year period of time with computer-assisted methods.

pdf bib
apPILcation: an Android-based Tool for Learning Mansi
Gábor Bobály | Csilla Horváth | Veronika Vincze
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Automatic Detection of Hungarian Clickbait and Entertaining Fake News
Veronika Vincze | Martina Katalin Szabó
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)

Online news do not always come from reliable sources and they are not always even realistic. The constantly growing number of online textual data has raised the need for detecting deception and bias in texts from different domains recently. In this paper, we identify different types of unrealistic news (clickbait and fake news written for entertainment purposes) written in Hungarian on the basis of a rich feature set and with the help of machine learning methods. Our tool achieves competitive scores: it is able to classify clickbait, fake news written for entertainment purposes and real news with an accuracy of over 80%. It is also highlighted that morphological features perform the best in this classification task.

2018

pdf bib
Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Carlos Ramisch | Silvio Ricardo Cordeiro | Agata Savary | Veronika Vincze | Verginica Barbu Mititelu | Archna Bhatia | Maja Buljan | Marie Candito | Polona Gantar | Voula Giouli | Tunga Güngör | Abdelati Hawwari | Uxoa Iñurrieta | Jolanta Kovalevskaitė | Simon Krek | Timm Lichte | Chaya Liebeskind | Johanna Monti | Carla Parra Escartín | Behrang QasemiZadeh | Renata Ramisch | Nathan Schneider | Ivelina Stoyanova | Ashwini Vaidya | Abigail Walsh
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

pdf bib
SzegedKoref: A Hungarian Coreference Corpus
Veronika Vincze | Klára Hegedűs | Alex Sliz-Nagy | Richárd Farkas
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
E-magyar – A Digital Language Processing System
Tamás Váradi | Eszter Simon | Bálint Sass | Iván Mittelholcz | Attila Novák | Balázs Indig | Richárd Farkas | Veronika Vincze
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Language technology resources and tools for Mansi: an overview
Csilla Horváth | Norbert Szilágyi | Veronika Vincze | Ágoston Nagy
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

pdf bib
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
Stella Markantonatou | Carlos Ramisch | Agata Savary | Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

pdf bib
The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Agata Savary | Carlos Ramisch | Silvio Cordeiro | Federico Sangati | Veronika Vincze | Behrang QasemiZadeh | Marie Candito | Fabienne Cap | Voula Giouli | Ivelina Stoyanova | Antoine Doucet
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

pdf bib
USzeged: Identifying Verbal Multiword Expressions with POS Tagging and Parsing Techniques
Katalin Ilona Simkó | Viktória Kovács | Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

The paper describes our system submitted for the Workshop on Multiword Expressions’ shared task on automatic identification of verbal multiword expressions. It uses POS tagging and dependency parsing to identify single- and multi-token verbal MWEs in text. Our system is language independent and competed on nine of the eighteen languages. Our paper describes how our system works and gives its error analysis for the languages it was submitted for.

pdf bib
Verb-Particle Constructions in Questions
Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

In this paper, we investigate the behavior of verb-particle constructions in English questions. We present a small dataset that contains questions and verb-particle construction candidates. We demonstrate that there are significant differences in the distribution of WH-words, verbs and prepositions/particles in sentences that contain VPCs and sentences that contain only verb + prepositional phrase combinations both by statistical means and in machine learning experiments. Hence, VPCs and non-VPCs can be effectively separated from each other by using a rich feature set, containing several novel features.

pdf bib
Hungarian Copula Constructions in Dependency Syntax and Parsing
Katalin Ilona Simkó | Veronika Vincze
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib
Universal Dependencies and Morphology for Hungarian - and on the Price of Universality
Veronika Vincze | Katalin Simkó | Zsolt Szántó | Richárd Farkas
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper, we present how the principles of universal dependencies and morphology have been adapted to Hungarian. We report the most challenging grammatical phenomena and our solutions to those. On the basis of the adapted guidelines, we have converted and manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format. We also introduce experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. Our results reveal that converting to universal dependencies is not necessarily trivial, moreover, using language-specific morphological features may have an impact on overall performance.

2016

pdf bib
A Hungarian Sentiment Corpus Manually Annotated at Aspect Level
Martina Katalin Szabó | Veronika Vincze | Katalin Ilona Simkó | Viktor Varga | Viktor Hangya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a Hungarian sentiment corpus manually annotated at aspect level. Our corpus consists of Hungarian opinion texts written about different types of products. The main aim of creating the corpus was to produce an appropriate database providing possibilities for developing text mining software tools. The corpus is a unique Hungarian database: to the best of our knowledge, no digitized Hungarian sentiment corpus that is annotated on the level of fragments and targets has been made so far. In addition, many language elements of the corpus, relevant from the point of view of sentiment analysis, got distinct types of tags in the annotation. In this paper, on the one hand, we present the method of annotation, and we discuss the difficulties concerning text annotation process. On the other hand, we provide some quantitative and qualitative data on the corpus. We conclude with a description of the applicability of the corpus.

pdf bib
Universal Morphology for Old Hungarian
Eszter Simon | Veronika Vincze
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Detecting Uncertainty Cues in Hungarian Social Media Texts
Veronika Vincze
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM)

In this paper, we aim at identifying uncertainty cues in Hungarian social media texts. We present our machine learning based uncertainty detector which is based on a rich features set including lexical, morphological, syntactic, semantic and discourse-based features, and we evaluate our system on a small set of manually annotated social media texts. We also carry out cross-domain and domain adaptation experiments using an annotated corpus of standard Hungarian texts and show that domain differences significantly affect machine learning. Furthermore, we argue that differences among uncertainty cue types may also affect the efficiency of uncertainty detection.

pdf bib
Where Bears Have the Eyes of Currant: Towards a Mansi WordNet
Csilla Horváth | Ágoston Nagy | Norbert Szilágyi | Veronika Vincze
Proceedings of the 8th Global WordNet Conference (GWC)

Here we report the construction of a wordnet for Mansi, an endangered minority language spoken in Russia. We will pay special attention to challenges that we encountered during the building process, among which the most important ones are the low number of native speakers, the lack of thesauri and the bear language. We will discuss our solutions to these issues, which might have some theoretical implications for the methodology of wordnet building in general.

pdf bib
Detecting Mild Cognitive Impairment by Exploiting Linguistic Information from Transcripts
Veronika Vincze | Gábor Gosztolya | László Tóth | Ildikó Hoffmann | Gréta Szatlóczki | Zoltán Bánréti | Magdolna Pákáski | János Kálmán
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2014

pdf bib
Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus
Veronika Vincze | Viktor Varga | Katalin Ilona Simkó | János Zsibrita | Ágoston Nagy | Richárd Farkas | János Csirik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Szeged Corpus is the largest manually annotated database containing the possible morphological analyses and lemmas for each word form. In this work, we present its latest version, Szeged Corpus 2.5, in which the new harmonized morphological coding system of Hungarian has been employed and, on the other hand, the majority of misspelled words have been corrected and tagged with the proper morphological code. New morphological codes are introduced for participles, causative / modal / frequentative verbs, adverbial pronouns and punctuation marks, moreover, the distinction between common and proper nouns is eliminated. We also report some statistical data on the frequency of the new morphological codes. The new version of the corpus made it possible to train magyarlanc, a data-driven POS-tagger of Hungarian on a dataset with the new harmonized codes. According to the results, magyarlanc is able to achieve a state-of-the-art accuracy score on the 2.5 version as well.

pdf bib
4FX: Light Verb Constructions in a Multilingual Parallel Corpus
Anita Rácz | István Nagy T. | Veronika Vincze
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe 4FX, a quadrilingual (English-Spanish-German-Hungarian) parallel corpus annotated for light verb constructions. We present the annotation process, and report statistical data on the frequency of LVCs in each language. We also offer inter-annotator agreement rates and we highlight some interesting facts and tendencies on the basis of comparing multilingual data from the four corpora. According to the frequency of LVC categories and the calculated Kendall’s coefficient for the four corpora, we found that Spanish and German are very similar to each other, Hungarian is also similar to both, but German differs from all these three. The qualitative and quantitative data analysis might prove useful in theoretical linguistic research for all the four languages. Moreover, the corpus will be an excellent testbed for the development and evaluation of machine learning based methods aiming at extracting or identifying light verb constructions in these four languages.

pdf bib
Automatic Error Detection concerning the Definite and Indefinite Conjugation in the HunLearner Corpus
Veronika Vincze | János Zsibrita | Péter Durst | Martina Katalin Szabó
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the results of automatic error detection, concerning the definite and indefinite conjugation in the extended version of the HunLearner corpus, the learners’ corpus of the Hungarian language. We present the most typical structures that trigger definite or indefinite conjugation in Hungarian and we also discuss the most frequent types of errors made by language learners in the corpus texts. We also illustrate the error types with sentences taken from the corpus. Our results highlight grammatical structures that might pose problems for learners of Hungarian, which can be fruitfully applied in the teaching and practicing of such constructions from the language teacher’s or learners’ point of view. On the other hand, these results may be exploited in extending the functionalities of a grammar checker, concerning the definiteness of the verb. Our automatic system was able to achieve perfect recall, i.e. it could find all the mismatches between the type of the object and the conjugation of the verb, which is promising for future studies in this area.

pdf bib
Non-Lexicalized Concepts in Wordnets: A Case Study of English and Hungarian
Veronika Vincze | Attila Almási
Proceedings of the Seventh Global Wordnet Conference

pdf bib
VPCTagger: Detecting Verb-Particle Constructions With Syntax-Based Methods
István Nagy T. | Veronika Vincze
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

pdf bib
Annotating Uncertainty in Hungarian Webtext
Veronika Vincze | Katalin Ilona Simkó | Viktor Varga
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

pdf bib
An Empirical Evaluation of Automatic Conversion from Constituency to Dependency in Hungarian
Katalin Ilona Simkó | Veronika Vincze | Zsolt Szántó | Richárd Farkas
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Uncertainty Detection in Hungarian Texts
Veronika Vincze
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
magyarlanc: A Tool for Morphological and Dependency Parsing of Hungarian
János Zsibrita | Veronika Vincze | Richárd Farkas
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Dependency Parsing for Identifying Hungarian Light Verb Constructions
Veronika Vincze | János Zsibrita | István Nagy T.
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Full-coverage Identification of English Light Verb Constructions
István Nagy T. | Veronika Vincze | Richárd Farkas
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Weasels, Hedges and Peacocks: Discourse-level Uncertainty in Wikipedia Articles
Veronika Vincze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
LFG-based Features for Noun Number and Article Grammatical Errors
Gábor Berend | Veronika Vincze | Sina Zarrieß | Richárd Farkas
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
Djamé Seddah | Reut Tsarfaty | Sandra Kübler | Marie Candito | Jinho D. Choi | Richárd Farkas | Jennifer Foster | Iakes Goenaga | Koldo Gojenola Galletebeitia | Yoav Goldberg | Spence Green | Nizar Habash | Marco Kuhlmann | Wolfgang Maier | Joakim Nivre | Adam Przepiórkowski | Ryan Roth | Wolfgang Seeker | Yannick Versley | Veronika Vincze | Marcin Woliński | Alina Wróblewska | Eric Villemonte de la Clergerie
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach
Veronika Vincze | István Nagy T. | Richárd Farkas
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Dependency Parsing of Hungarian: Baseline Results and Challenges
Richárd Farkas | Veronika Vincze | Helmut Schmid
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
How to Evaluate Opinionated Keyphrase Extraction?
Gábor Berend | Veronika Vincze
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

pdf bib
Light Verb Constructions in the SzegedParalellFX English–Hungarian Parallel Corpus
Veronika Vincze
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we describe the first English-Hungarian parallel corpus annotated for light verb constructions, which contains 14,261 sentence alignment units. Annotation principles and statistical data on the corpus are also provided, and English and Hungarian data are contrasted. On the basis of corpus data, a database containing pairs of English-Hungarian light verb constructions has been created as well. The corpus and the database can contribute to the automatic detection of light verb constructions and it is also shown how they can enhance performance in several fields of NLP (e.g. parsing, information extraction/retrieval and machine translation).

pdf bib
HunOr: A Hungarian—Russian Parallel Corpus
Martina Katalin Szabó | Veronika Vincze | István Nagy T.
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present HunOr, the first multi-domain Hungarian―Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a syntactic annotation of the HunOr corpus, which will further enhance the usability of the corpus in various NLP fields such as transfer-based machine translation or cross lingual information retrieval.

pdf bib
Cross-Genre and Cross-Domain Detection of Semantic Uncertainty
György Szarvas | Veronika Vincze | Richárd Farkas | György Móra | Iryna Gurevych
Computational Linguistics, Volume 38, Issue 2 - June 2012

2011

pdf bib
Noun Compound and Named Entity Recognition and their Usability in Keyphrase Extraction
István Nagy T. | Gábor Berend | Veronika Vincze
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Multiword Expressions and Named Entities in the Wiki50 Corpus
Veronika Vincze | István Nagy T. | Gábor Berend
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Domain-Dependent Identification of Multiword Expressions
István Nagy T. | Veronika Vincze | Gábor Berend
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Domain-Dependent Detection of Light Verb Constructions
István T. Nagy | Gábor Berend | György Móra | Veronika Vincze
Proceedings of the Second Student Research Workshop associated with RANLP 2011

pdf bib
Inter-domain Opinion Phrase Extraction Based on Feature Augmentation
Gábor Berend | István T. Nagy | György Móra | Veronika Vincze
Proceedings of the Second Student Research Workshop associated with RANLP 2011

pdf bib
Detecting Noun Compounds and Light Verb Constructions: a Contrastive Study
Veronika Vincze | István Nagy T. | Gábor Berend
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

2010

pdf bib
Hungarian Dependency Treebank
Veronika Vincze | Dóra Szauter | Attila Almási | György Móra | Zoltán Alexin | János Csirik
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Herein, we present the process of developing the first Hungarian Dependency TreeBank. First, short references are made to dependency grammars we considered important in the development of our Treebank. Second, mention is made of existing dependency corpora for other languages. Third, we present the steps of converting the Szeged Treebank into dependency-tree format: from the originally phrase-structured treebank, we produced dependency trees by automatic conversion, checked and corrected them thereby creating the first manually annotated dependency corpus for Hungarian. We also go into detail about the two major sets of problems, i.e. coordination and predicative nouns and adjectives. Fourth, we give statistics on the treebank: by now, we have completed the annotation of business news, newspaper articles, legal texts and texts in informatics, at the same time, we are planning to convert the entire corpus into dependency tree format. Finally, we give some hints on the applicability of the system: the present database may be utilized ― among others ― in information extraction and machine translation as well.

pdf bib
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task
Richárd Farkas | Veronika Vincze | György Szarvas | György Móra | János Csirik
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text
Richárd Farkas | Veronika Vincze | György Móra | János Csirik | György Szarvas
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
Speculation and negation annotation in natural language texts: what the case of BioScope might (not) reveal
Veronika Vincze
Proceedings of the Workshop on Negation and Speculation in Natural Language Processing

pdf bib
Hungarian Corpus of Light Verb Constructions
Veronika Vincze | János Csirik
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2008

pdf bib
Hungarian Word-Sense Disambiguated Corpus
Veronika Vincze | György Szarvas | Attila Almási | Dóra Szauter | Róbert Ormándi | Richárd Farkas | Csaba Hatvani | János Csirik
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage, and to have more than one sense considered frequent in usage. HNC and its Heti Világgazdaság subcorpus provided the basis for corpus text selection. This way, each sample has a relevant context (whole article), and information on the lemma, POS-tagging and automatic tokenization is also available. When planning the corpus, 300-500 samples of each word form were to be annotated. This size makes it possible that the subcorpora prepared for the individual word forms can be compared to data available for other languages. However, the finalized database also contains unannotated samples and samples with single annotation, which were annotated only by one of the linguists. The corpus follows the ACL’s SensEval/SemEval WSD tasks format. The first version of the corpus was developed within the scope of the project titled The construction Hungarian WordNet Ontology and its application in Information Extraction Systems (Hatvani et al., 2007). The corpus “ for research and educational purposes” is available and can be downloaded free of charge.

pdf bib
The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts
György Szarvas | Veronika Vincze | Richárd Farkas | János Csirik
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

Search
Co-authors