Somara Seng


2010

pdf bib
OAL: A NLP Architecture to Improve the Development of Linguistic Resources for NLP
Javier Couto | Helena Blancafort | Somara Seng | Nicolas Kuchmann-Beauger | Anass Talby | Claude de Loupy
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The performance of most NLP applications relies upon the quality of linguistic resources. The creation, maintenance and enrichment of those resources are a labour-intensive task, especially when no tools are available. In this paper we present the NLP architecture OAL, designed to assist computational linguists in the whole process of the development of resources in an industrial context: from corpora compilation to quality assurance. To add new words more easily to the morphosyntactic lexica, a guesser that lemmatizes and assigns morphosyntactic tags as well as inflection paradigms to a new word has been developed. Moreover, different control mechanisms are set up to check the coherence and consistency of the resources. Today OAL manages resources in five European languages: French, English, Spanish, Italian and Polish. Chinese and Portuguese are in process. The development of OAL has followed an incremental strategy. At present, semantic lexica, a named entities guesser and a named entities phonetizer are being developed.

pdf bib
A French Human Reference Corpus for Multi-Document Summarization and Sentence Compression
Claude de Loupy | Marie Guégan | Christelle Ayache | Somara Seng | Juan-Manuel Torres Moreno
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents two corpora produced within the RPM2 project: a multi-document summarization corpus and a sentence compression corpus. Both corpora are in French. The first one is the only one we know in this language. It contains 20 topics with 20 documents each. A first set of 10 documents per topic is summarized and then the second set is used to produce an update summarization (new information). 4 annotators were involved and produced a total of 160 abstracts. The second corpus contains all the sentences of the first one. 4 annotators were asked to compress the 8432 sentences. This is the biggest corpus of compressed sentences we know, whatever the language. The paper provides some figures in order to compare the different annotators: compression rates, number of tokens per sentence, percentage of tokens kept according to their POS, position of dropped tokens in the sentence compression phase, etc. These figures show important differences from an annotator to the other. Another point is the different strategies of compression used according to the length of the sentence.