Guiyao Ke


2014

pdf bib
Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora
Guiyao Ke | Pierre-Francois Marteau | Gildas Menier
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Following the pioneering work by (CITATION), we address in this paper the analysis of a family of quantitative comparability measures dedicated to the construction and evaluation of topical comparable corpora. After recalling the definition of the quantitative comparability measure proposed by (CITATION), we develop some variants of this measure based primarily on the consideration that the occurrence frequencies of lexical entries and the number of their translations are important. We compare the respective advantages and disadvantages of these variants in the context of an evaluation framework that is based on the progressive degradation of the Europarl parallel corpus. The degradation is obtained by replacing either deterministically or randomly a varying amount of lines in blocks that compose partitions of the initial Europarl corpus. The impact of the coverage of bilingual dictionaries on these measures is also discussed and perspectives are finally presented.

pdf bib
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
Guiyao Ke | Pierre-Francois Marteau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a “thematic” comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering (k-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide “thematic” comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.

2013

pdf bib
Some variations on quantitative comparability measures and evaluations on synthetic French-English comparable corpora (Quelques variations sur les mesures de comparabilité quantitatives et évaluations sur des corpus comparables Français-Anglais synthétiques) [in French]
Guiyao Ke
Proceedings of RECITAL 2013