Mojgan Seraji


2016

pdf bib
Universal Dependencies for Persian
Mojgan Seraji | Filip Ginter | Joakim Nivre
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Persian Universal Dependency Treebank (Persian UD) is a recent effort of treebanking Persian with Universal Dependencies (UD), an ongoing project that designs unified and cross-linguistically valid grammatical representations including part-of-speech tags, morphological features, and dependency relations. The Persian UD is the converted version of the Uppsala Persian Dependency Treebank (UPDT) to the universal dependencies framework and consists of nearly 6,000 sentences and 152,871 word tokens with an average sentence length of 25 words. In addition to the universal dependencies syntactic annotation guidelines, the two treebanks differ in tokenization. All words containing unsegmented clitics (pronominal and copula clitics) annotated with complex labels in the UPDT have been separated from the clitics and appear with distinct labels in the Persian UD. The treebank has its original syntactic annotation scheme based on Stanford Typed Dependencies. In this paper, we present the approaches taken in the development of the Persian UD.

2015

pdf bib
ParsPer: A Dependency Parser for Persian
Mojgan Seraji | Bernd Bohnet | Joakim Nivre
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

2014

pdf bib
A Persian Treebank with Stanford Typed Dependencies
Mojgan Seraji | Carina Jahani | Beáta Megyesi | Joakim Nivre
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the Uppsala Persian Dependency Treebank (UPDT) with a syntactic annotation scheme based on Stanford Typed Dependencies. The treebank consists of 6,000 sentences and 151,671 tokens with an average sentence length of 25 words. The data is from different genres, including newspaper articles and fiction, as well as technical descriptions and texts about culture and art, taken from the open source Uppsala Persian Corpus (UPC). The syntactic annotation scheme is extended for Persian to include all syntactic relations that could not be covered by the primary scheme developed for English. In addition, we present open source tools for automatic analysis of Persian containing a text normalizer, a sentence segmenter and tokenizer, a part-of-speech tagger, and a parser. The treebank and the parser have been developed simultaneously in a bootstrapping procedure. The result of a parsing experiment shows an overall labeled attachment score of 82.05% and an unlabeled attachment score of 85.29%. The treebank is freely available as an open source resource.

2012

pdf bib
Dependency Parsers for Persian
Mojgan Seraji | Beata Megyesi | Joakim Nivre
Proceedings of the 10th Workshop on Asian Language Resources

pdf bib
A Basic Language Resource Kit for Persian
Mojgan Seraji | Beáta Megyesi | Joakim Nivre
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exploring the reusability of existing resources and adapting state-of-the-art methods for the linguistic annotation. We present fully functional tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and parsing. As for resources, we describe the Uppsala PErsian Corpus (UPEC) which is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation. The corpus consists of 2,782,109 tokens and is annotated with parts of speech and morphological features. A treebank is derived from UPEC with an annotation scheme based on Stanford Typed Dependencies and is planned to consist of 10,000 sentences of which 215 have already been annotated. Keywords: BLARK for Persian, PoS tagged corpus, Persian treebank

2011

pdf bib
A Statistical Part-of-Speech Tagger for Persian
Mojgan Seraji
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)