Olivier Galibert

2022

pdf bib abs
Analyzing BERT Cross-lingual Transfer Capabilities in Continual Sequence Labeling
Juan Manuel Coria | Mathilde Veron | Sahar Ghannay | Guillaume Bernard | Hervé Bredin | Olivier Galibert | Sophie Rosset
Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models

Knowledge transfer between neural language models is a widely used technique that has proven to improve performance in a multitude of natural language tasks, in particular with the recent rise of large pre-trained language models like BERT. Similarly, high cross-lingual transfer has been shown to occur in multilingual language models. Hence, it is of great importance to better understand this phenomenon as well as its limits. While most studies about cross-lingual transfer focus on training on independent and identically distributed (i.e. i.i.d.) samples, in this paper we study cross-lingual transfer in a continual learning setting on two sequence labeling tasks: slot-filling and named entity recognition. We investigate this by training multilingual BERT on sequences of 9 languages, one language at a time, on the MultiATIS++ and MultiCoNER corpora. Our first findings are that forward transfer between languages is retained although forgetting is present. Additional experiments show that lost performance can be recovered with as little as a single training epoch even if forgetting was high, which can be explained by a progressive shift of model parameters towards a better multilingual initialization. We also find that commonly used metrics might be insufficient to assess continual learning performance.

pdf bib abs
Attention Modulation for Zero-Shot Cross-Domain Dialogue State Tracking
Mathilde Veron | Olivier Galibert | Guillaume Bernard | Sophie Rosset
Proceedings of the 3rd Workshop on Computational Approaches to Discourse

Dialog state tracking (DST) is a core step for task-oriented dialogue systems aiming to track the user’s current goal during a dialogue. Recently a special focus has been put on applying existing DST models to new domains, in other words performing zero-shot cross-domain transfer. While recent state-of-the-art models leverage large pre-trained language models, no work has been made on understanding and improving the results of first developed zero-shot models like SUMBT. In this paper, we thus propose to improve SUMBT zero-shot results on MultiWOZ by using attention modulation during inference. This method improves SUMBT zero-shot results significantly on two domains and does not worsen the initial performance with the great advantage of needing no additional training.

2020

pdf bib abs
Findings of the First Shared Task on Lifelong Learning Machine Translation
Loïc Barrault | Magdalena Biesialska | Marta R. Costa-jussà | Fethi Bougares | Olivier Galibert
Proceedings of the Fifth Conference on Machine Translation

A lifelong learning system can adapt to new data without forgetting previously acquired knowledge. In this paper, we introduce the first benchmark for lifelong learning machine translation. For this purpose, we provide training, lifelong and test data sets for two language pairs: English-German and English-French. Additionally, we report the results of our baseline systems, which we make available to the public. The goal of this shared task is to encourage research on the emerging topic of lifelong learning machine translation.

pdf bib abs
Évaluation de systèmes apprenant tout au long de la vie (Evaluation of lifelong learning systems )
Yevhenii Prokopalo | Sylvain Meignier | Olivier Galibert | Loïc Barrault | Anthony Larcher
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole

Aujourd’hui les systèmes intelligents obtiennent d’excellentes performances dans de nombreux domaines lorsqu’ils sont entraînés par des experts en apprentissage automatique. Lorsque ces systèmes sont mis en production, leurs performances se dégradent au cours du temps du fait de l’évolution de leur environnement réel. Une adaptation de leur modèle par des experts en apprentissage automatique est possible mais très coûteuse alors que les sociétés utilisant ces systèmes disposent d’experts du domaine qui pourraient accompagner ces systèmes dans un apprentissage tout au long de la vie. Dans cet article nous proposons un cadre d’évaluation générique pour des systèmes apprenant tout au long de la vie (SATLV). Nous proposons d’évaluer l’apprentissage assisté par l’humain (actif ou interactif) et l’apprentissage au cours du temps.

pdf bib abs
Evaluation of Lifelong Learning Systems
Yevhenii Prokopalo | Sylvain Meignier | Olivier Galibert | Loic Barrault | Anthony Larcher
Proceedings of the Twelfth Language Resources and Evaluation Conference

Current intelligent systems need the expensive support of machine learning experts to sustain their performance level when used on a daily basis. To reduce this cost, i.e. remaining free from any machine learning expert, it is reasonable to implement lifelong (or continuous) learning intelligent systems that will continuously adapt their model when facing changing execution conditions. In this work, the systems are allowed to refer to human domain experts who can provide the system with relevant knowledge about the task. Nowadays, the fast growth of lifelong learning systems development rises the question of their evaluation. In this article we propose a generic evaluation methodology for the specific case of lifelong learning systems. Two steps will be considered. First, the evaluation of human-assisted learning (including active and/or interactive learning) outside the context of lifelong learning. Second, the system evaluation across time, with propositions of how a lifelong learning intelligent system should be evaluated when including human assisted learning or not.

2018

pdf bib abs
Analyzing Learned Representations of a Deep ASR Performance Prediction Model
Zied Elloumi | Laurent Besacier | Olivier Galibert | Benjamin Lecouteux
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our prediction model. We try to better understand which information is captured by the deep model and its relation with different conditioning factors. It is shown that hidden layers convey a clear signal about speech style, accent and broadcast type. We then try to leverage these 3 types of information at training time through multi-task learning. Our experiments show that this allows to train slightly more efficient ASR performance prediction systems that - in addition - simultaneously tag the analyzed utterances according to their speech style, accent and broadcast program origin.

pdf bib
Prédiction de performance des systèmes de reconnaissance automatique de la parole à l’aide de réseaux de neurones convolutifs [Performance prediction of automatic speech recognition systems using convolutional neural networks]
Zied Elloumi | Benjamin Lecouteux | Olivier Galibert | Laurent Besacier
Traitement Automatique des Langues, Volume 59, Numéro 2 : Apprentissage profond pour le traitement automatique des langues [Deep Learning for natural language processing]

pdf bib
Matics Software Suite: New Tools for Evaluation and Data Exploration
Olivier Galibert | Guillaume Bernard | Agnes Delaborde | Sabrina Lecadre | Juliette Kahn
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs
Comparaison de listes d’erreurs de transcription automatique de la parole : quelle complémentarité entre les différentes métriques ? (Comparing error lists for ASR systems : contribution of different metrics)
Olivier Galibert | Juliette Kahn | Sophie Rosset
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP

Le travail que nous présentons ici s’inscrit dans le domaine de l’évaluation des systèmes de reconnaissance automatique de la parole en vue de leur utilisation dans une tâche aval, ici la reconnaissance des entités nommées. Plus largement, la question que nous nous posons est “que peut apporter une métrique d’évaluation en dehors d’un score ?". Nous nous intéressons particulièrement aux erreurs des systèmes et à leur analyse et éventuellement à l’utilisation de ce que nous connaissons de ces erreurs. Nous étudions dans ce travail les listes ordonnées d’erreurs générées à partir de différentes métriques et analysons ce qui en ressort. Nous avons appliqué la même méthode sur les sorties de différents systèmes de reconnaissance de la parole. Nos expériences mettent en évidence que certaines métriques apportent une information plus pertinente étant donné une tâche et transverse à différents systèmes.

pdf bib abs
Estimation de la qualité d’un système de reconnaissance de la parole pour une tâche de compréhension (Quality estimation of a Speech Recognition System for a Spoken Language Understanding task)
Olivier Galibert | Nathalie Camelin | Paul Deléglise | Sophie Rosset
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP

Nous nous intéressons à l’évaluation de la qualité des systèmes de reconnaissance de la parole étant donné une tâche de compréhension. L’objectif de ce travail est de fournir un outil permettant la sélection d’un système de reconnaissance automatique de la parole le plus adapté pour un système de dialogue donné. Nous comparons ici différentes métriques, notamment le WER, NE-WER et ATENE métrique proposée récemment pour l’évaluation des systèmes de reconnaissance de la parole étant donné une tâche de reconnaissance d’entités nommées. Cette dernière métrique montrait une meilleure corrélation avec les résultats de la tâche globale que toutes les autres métriques testées. Nos mesures indiquent une très forte corrélation avec la mesure ATENE et une moins forte avec le WER.

pdf bib abs
LNE-Visu : a tool to explore and visualize multimedia data
Guillaume Bernard | Juliette Kahn | Olivier Galibert | Rémi Regnier | Séverine Demeyer
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 5 : Démonstrations

LNE-Visu : a tool to explore and visualize multimedia data LNE-Visu is a tool to explore and visualize multimedia data created for the LNE evaluation campaigns. 3 functionalities are available: explore and select data, visualize and listen data, apply significance tests

pdf bib abs
Generating Task-Pertinent sorted Error Lists for Speech Recognition
Olivier Galibert | Mohamed Ameur Ben Jannet | Juliette Kahn | Sophie Rosset
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Automatic Speech recognition (ASR) is one of the most widely used components in spoken language processing applications. ASR errors are of varying importance with respect to the application, making error analysis keys to improving speech processing applications. Knowing the most serious errors for the applicative case is critical to build better systems. In the context of Automatic Speech Recognition (ASR) used as a first step towards Named Entity Recognition (NER) in speech, error seriousness is usually determined by their frequency, due to the use of the WER as metric to evaluate the ASR output, despite the emergence of more relevant measures in the literature. We propose to use a different evaluation metric form the literature in order to classify ASR errors according to their seriousness for NER. Our results show that the ASR errors importance is ranked differently depending on the used evaluation metric. A more detailed analysis shows that the estimation of the error impact given by the ATENE metric is more adapted to the NER task than the estimation based only on the most used frequency metric WER.

2014

pdf bib abs
The ETAPE speech processing evaluation
Olivier Galibert | Jeremy Leixa | Gilles Adda | Khalid Choukri | Guillaume Gravier
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The ETAPE evaluation is the third evaluation in automatic speech recognition and associated technologies in a series which started with ESTER. This evaluation proposed some new challenges, by proposing TV and radio shows with prepared and spontaneous speech, annotation and evaluation of overlapping speech, a cross-show condition in speaker diarization, and new, complex but very informative named entities in the information extraction task. This paper presents the whole campaign, including the data annotated, the metrics used and the anonymized system results. All the data created in the evaluation, hopefully including system outputs, will be distributed through the ELRA catalogue in the future.

pdf bib abs
ETER : a new metric for the evaluation of hierarchical named entity recognition
Mohamed Ben Jannet | Martine Adda-Decker | Olivier Galibert | Juliette Kahn | Sophie Rosset
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper addresses the question of hierarchical named entity evaluation. In particular, we focus on metrics to deal with complex named entity structures as those introduced within the QUAERO project. The intended goal is to propose a smart way of evaluating partially correctly detected complex entities, beyond the scope of traditional metrics. None of the existing metrics are fully adequate to evaluate the proposed QUAERO task involving entity detection, classification and decomposition. We are discussing the strong and weak points of the existing metrics. We then introduce a new metric, the Entity Tree Error Rate (ETER), to evaluate hierarchical and structured named entity detection, classification and decomposition. The ETER metric builds upon the commonly accepted SER metric, but it takes the complex entity structure into account by measuring errors not only at the slot (or complex entity) level but also at a basic (atomic) entity level. We are comparing our new metric to the standard one using first some examples and then a set of real data selected from the ETAPE evaluation results.

2013

2012

pdf bib
REPERE : premiers résultats d’un défi autour de la reconnaissance multimodale des personnes (REPERE : preliminary results of a multimodal person recognition challenge) [in French]
Juliette Kahn | Aude Giraudel | Matthieu Carré | Olivier Galibert | Ludovic Quintard
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

pdf bib abs
Extended Named Entities Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign
Olivier Galibert | Sophie Rosset | Cyril Grouin | Pierre Zweigenbaum | Ludovic Quintard
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Within the framework of the Quaero project, we proposed a new definition of named entities, based upon an extension of the coverage of named entities as well as the structure of those named entities. In this new definition, the extended named entities we proposed are both hierarchical and compositional. In this paper, we focused on the annotation of a corpus composed of press archives, OCRed from French newspapers of December 1890. We present the methodology we used to produce the corpus and the characteristics of the corpus in terms of named entities annotation. This annotated corpus has been used in an evaluation campaign. We present this evaluation, the metrics we used and the results obtained by the participants.

pdf bib abs
The ETAPE corpus for the evaluation of speech-based TV content processing in the French language
Guillaume Gravier | Gilles Adda | Niklas Paulsson | Matthieu Carré | Aude Giraudel | Olivier Galibert
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents a comprehensive overview of existing data for the evaluation of spoken content processing in a multimedia framework for the French language. We focus on the ETAPE corpus which will be made publicly available by ELDA mid 2012, after completion of the evaluation campaign, and recall existing resources resulting from previous evaluation campaigns. The ETAPE corpus consists of 30 hours of TV and radio broadcasts, selected to cover a wide variety of topics and speaking styles, emphasizing spontaneous speech and multiple speaker areas.

pdf bib abs
Analyzing the Impact of Prevalence on the Evaluation of a Manual Annotation Campaign
Karën Fort | Claire François | Olivier Galibert | Maha Ghribi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article details work aiming at evaluating the quality of the manual annotation of gene renaming couples in scientific abstracts, which generates sparse annotations. To evaluate these annotations, we compare the results obtained using the commonly advocated inter-annotator agreement coefficients such as S, κ and Ï, the less known R, the weighted coefficients κÏ and Î± as well as the F-measure and the SER. We analyze to which extent they are relevant for our data. We then study the bias introduced by prevalence by changing the way the contingency table is built. We finally propose an original way to synthesize the results by computing distances between categories, based on the produced annotations.

pdf bib abs
The REPERE Corpus : a multimodal corpus for person recognition
Aude Giraudel | Matthieu Carré | Valérie Mapelli | Juliette Kahn | Olivier Galibert | Ludovic Quintard
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The REPERE Challenge aims to support research on people recognition in multimodal conditions. To assess the technology progression, annual evaluation campaigns will be organized from 2012 to 2014. In this context, the REPERE corpus, a French videos corpus with multimodal annotation, has been developed. This paper presents datasets collected for the dry run test that took place at the beginning of 2012. Specific annotation tools and guidelines are mainly described. At the time being, 6 hours of data have been collected and annotated. Last section presents analyses of annotation distribution and interaction between modalities in the corpus.

pdf bib
Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers
Sophie Rosset | Cyril Grouin | Karën Fort | Olivier Galibert | Juliette Kahn | Pierre Zweigenbaum
Proceedings of the Sixth Linguistic Annotation Workshop

2011

pdf bib
Structured and Extended Named Entity Evaluation in Automatic Speech Transcriptions
Olivier Galibert | Sophie Rosset | Cyril Grouin | Pierre Zweigenbaum | Ludovic Quintard
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Proposal for an Extension of Traditional Named Entities: From Guidelines to Evaluation, an Overview
Cyril Grouin | Sophie Rosset | Pierre Zweigenbaum | Karën Fort | Olivier Galibert | Ludovic Quintard
Proceedings of the 5th Linguistic Annotation Workshop

2010

pdf bib abs
Hybrid Citation Extraction from Patents
Olivier Galibert | Sophie Rosset | Xavier Tannier | Fanny Grandry
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Quaero project organized a set of evaluations of Named Entity recognition systems in 2009. One of the sub-tasks consists in extracting citations from patents, i.e. references to other documents, either other patents or general literature from English-language patents. We present in this paper the participation of LIMSI in this evaluation, with a complete system description and the evaluation results. The corpus shown that patent and non-patent citations have a very different nature. We then separated references to other patents and to general literature papers and we created a hybrid system. For patent citations, the system used rule-based expert knowledge on the form of regular expressions. The system for detecting non-patent citations, on the other hand, is purely stochastic (machine learning with CRF++). Then we mixed both approaches to provide a single output. 4 teams participated to this task and our system obtained the best results of this evaluation campaign, even if the difference between the first two systems is poorly significant.

In the QA and information retrieval domains progress has been assessed via evaluation campaigns(Clef, Ntcir, Equer, Trec).In these evaluations, the systems handle independent questions and should provide one answer to each question, extracted from textual data, for both open domain and restricted domain. Quæro is a program promoting research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Among the many research areas concerned by Quæro. The Quaero project organized a series of evaluations of Question Answering on Web Data systems in 2008 and 2009. For each language, English and French the full corpus has a size of around 20Gb for 2.5M documents. We describe the task and corpora, and especially the methodologies used in 2008 to construct the test of question and a new one in the 2009 campaign. Six types of questions were addressed, factual, Non-factual(How, Why, What), List, Boolean. A description of the participating systems and the obtained results is provided. We show the difficulty for a question-answering system to work with complex data and questions.

The Quæro program that promotes research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Within its context a set of evaluations of Named Entity recognition systems was held in 2009. Four tasks were defined. The first two concerned traditional named entities in French broadcast news for one (a rerun of ESTER 2) and of OCR-ed old newspapers for the other. The third was a gene and protein name extraction in medical abstracts. The last one was the detection of references in patents. Four different partners participated, giving a total of 16 systems. We provide a synthetic descriptions of all of them classifying them by the main approaches chosen (resource-based, rules-based or statistical), without forgetting the fact that any modern system is at some point hybrid. The metric (the relatively standard Slot Error Rate) and the results are also presented and discussed. Finally, a process is ongoing with preliminary acceptance of the partners to ensure the availability for the community of all the corpora used with the exception of the non-Quæro produced ESTER 2 one.

Question Answering (QA) technology aims at providing relevant answers to natural language questions. Most Question Answering research has focused on mining document collections containing written texts to answer written questions. In addition to written sources, a large (and growing) amount of potentially interesting information appears in spoken documents, such as broadcast news, speeches, seminars, meetings or telephone conversations. The QAST track (Question-Answering on Speech Transcripts) was introduced in CLEF to investigate the problem of question answering in such audio documents. This paper describes in detail the evaluation protocol and tools designed and developed for the CLEF-QAST evaluation campaigns that have taken place between 2007 and 2009. We first remind the data, question sets, and submission procedures that were produced or set up during these three campaigns. As for the evaluation procedure, the interface that was developed to ease the assessors work is described. In addition, this paper introduces a methodology for a semi-automatic evaluation of QAST systems based on time slot comparisons. Finally, the QAST Evaluation Package 2007-2009 resulting from these evaluation campaigns is also introduced.

pdf bib abs
A Question-answer Distance Measure to Investigate QA System Progress
Guillaume Bernard | Sophie Rosset | Martine Adda-Decker | Olivier Galibert
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The performance of question answering system is evaluated through successive evaluations campaigns. A set of questions are given to the participating systems which are to find the correct answer in a collection of documents. The creation process of the questions may change from one evaluation to the next. This may entail an uncontroled question difficulty shift. For the QAst 2009 evaluation campaign, a new procedure was adopted to build the questions. Comparing results of QAst 2008 and QAst 2009 evaluations, a strong performance loss could be measured in 2009 for French and English, while the Spanish systems globally made progress. The measured loss might be related to this new way of elaborating questions. The general purpose of this paper is to propose a measure to calibrate the difficulty of a question set. In particular, a reasonable measure should output higher values for 2009 than for 2008. The proposed measure relies on a distance measure between the critical elements of a question and those of the associated correct answer. An increase of the proposed distance measure for French and English 2009 evaluations as compared to 2008 could be established. This increase correlates with the previously observed degraded performances. We conclude on the potential of this evaluation criterion: the importance of such a measure for the elaboration of new question corpora for questions answering systems and a tool to control the level of difficulty for successive evaluation campaigns.

2008

pdf bib abs
An Evaluation of Spoken and Textual Interaction in the RITEL Interactive Question Answering System
Dave Toney | Sophie Rosset | Aurélien Max | Olivier Galibert | Eric Bilinski
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The RITEL project aims to integrate a spoken language dialogue system and an open-domain information retrieval system in order to enable human users to ask a general question and to refine their search for information interactively. This type of system is often referred to as an Interactive Question Answering (IQA) system. In this paper, we present an evaluation of how the performance of the RITEL system differs when users interact with it using spoken versus textual input and output. Our results indicate that while users do not perceive the two versions to perform significantly differently, many more questions are asked in a typical text-based dialogue.

2005

pdf bib abs
Ritel : un système de dialogue homme-machine à domaine ouvert
Olivier Galibert | Gabriel Illouz | Sophie Rosset
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

L’objectif du projet RITEL est de réaliser un système de dialogue homme-machine permettant à un utilisateur de poser oralement des questions, et de dialoguer avec un système de recherche d’information généraliste (par exemple, chercher sur l’Internet “Qui est le Président du Sénat ?”) et d’en étudier les potentialités. Actuellement, la plateforme RITEL permet de collecter des corpus de dialogue homme-machine. Les utilisateurs peuvent parfois obtenir une réponse, de type factuel (Q : qui est le président de la France ; R : Jacques Chirac.). Cet article présente brièvement la plateforme développée, le corpus collecté ainsi que les questions que soulèvent un tel système et quelques unes des premières solutions envisagées.