Djamé Seddah

Also published as: Djame Seddah

2023

pdf bib abs
Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language
Arij Riabi | Menel Mahamdi | Djamé Seddah
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

In this paper we address the scarcity of annotated data for NArabizi, a Romanized form of North African Arabic used mostly on social media, which poses challenges for Natural Language Processing (NLP). We introduce an enriched version of NArabizi Treebank (Seddah et al., 2020) with three main contributions: the addition of two novel annotation layers (named entity recognition and offensive language detection) and a re-annotation of the tokenization, morpho-syntactic and syntactic layers that ensure annotation consistency. Our experimental results, using different tokenization schemes, showcase the value of our contributions and highlight the impact of working with non-gold tokenization for NER and dependency parsing. To facilitate future research, we make these annotations publicly available. Our enhanced NArabizi Treebank paves the way for creating sophisticated language models and NLP tools for this under-represented language.

pdf bib abs
Data-Efficient French Language Modeling with CamemBERTa
Wissam Antoun | Benoît Sagot | Djamé Seddah
Findings of the Association for Computational Linguistics: ACL 2023

Recent advances in NLP have significantly improved the performance of language models on a variety of tasks. While these advances are largely driven by the availability of large amounts of data and computational power, they also benefit from the development of better training methods and architectures. In this paper, we introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective. We evaluate our model’s performance on a variety of French downstream tasks and datasets, including question answering, part-of-speech tagging, dependency parsing, named entity recognition, and the FLUE benchmark, and compare against CamemBERT, the state-of-the-art monolingual model for French. Our results show that, given the same amount of training tokens, our model outperforms BERT-based models trained with MLM on most tasks. Furthermore, our new model reaches similar or superior performance on downstream tasks compared to CamemBERT, despite being trained on only 30% of its total number of input tokens. In addition to our experimental results, we also publicly release the weights and code implementation of CamemBERTa, making it the first publicly available DeBERTaV3 model outside of the original paper and the first openly available implementation of a DeBERTaV3 training objective.

pdf bib abs
Towards a Robust Detection of Language Model-Generated Text: Is ChatGPT that easy to detect?
Wissam Antoun | Virginie Mouilleron | Benoît Sagot | Djamé Seddah
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

Recent advances in natural language processing (NLP) have led to the development of large language models (LLMs) such as ChatGPT. This paper proposes a methodology for developing and evaluating ChatGPT detectors for French text, with a focus on investigating their robustness on out-of-domain data and against common attack schemes. The proposed method involves translating an English dataset into French and training a classifier on the translated data. Results show that the detectors can effectively detect ChatGPT-generated text, with a degree of robustness against basic attack techniques in in-domain settings. However, vulnerabilities are evident in out-of-domain contexts, highlighting the challenge of detecting adversarial text. The study emphasizes caution when applying in-domain testing results to a wider variety of content. We provide our translated datasets and models as open-source resources.

pdf bib abs
Multi-way Variational NMT for UGC: Improving Robustness in Zero-shot Scenarios via Mixture Density Networks
José Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

This work presents a novel Variational Neural Machine Translation (VNMT) architecture with enhanced robustness properties, which we investigate through a detailed case-study addressing noisy French user-generated content (UGC) translation to English. We show that the proposed model, with results comparable or superior to state-of-the-art VNMT, improves performance over UGC translation in a zero-shot evaluation scenario while keeping optimal translation scores on in-domain test sets. We elaborate on such results by visualizing and explaining how neural learning representations behave when processing UGC noise. In addition, we show that VNMT enforces robustness to the learned embeddings, which can be later used for robust transfer learning approaches.

pdf bib abs
Analyzing Zero-Shot transfer Scenarios across Spanish variants for Hate Speech Detection
Galo Castillo-lópez | Arij Riabi | Djamé Seddah
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Hate speech detection in online platforms has been widely studied inthe past. Most of these works were conducted in English and afew rich-resource languages. Recent approaches tailored forlow-resource languages have explored the interests of zero-shot cross-lingual transfer learning models in resource-scarce scenarios. However, languages variations between geolects such as AmericanEnglish and British English, Latin-American Spanish, and EuropeanSpanish is still a problem for NLP models that often relies on(latent) lexical information for their classification tasks. Moreimportantly, the cultural aspect, crucial for hate speech detection,is often overlooked. In this work, we present the results of a thorough analysis of hatespeech detection models performance on different variants of Spanish,including a new hate speech toward immigrants Twitter data set we built to cover these variants. Using mBERT and Beto, a monolingual Spanish Bert-based language model, as the basis of our transfer learning architecture, our results indicate that hate speech detection models for a given Spanish variant are affected when different variations of such language are not considered. Hate speech expressions could vary from region to region where the same language is spoken. Our new dataset, models and guidelines are freely available.

2022

Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained from scratch for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.

pdf bib abs
Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models
Syrielle Montariol | Arij Riabi | Djamé Seddah
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Zero-shot cross-lingual transfer learning has been shown to be highly challenging for tasks involving a lot of linguistic specificities or when a cultural gap is present between lan- guages, such as in hate speech detection. In this paper, we highlight this limitation for hate speech detection in several domains and languages using strict experimental settings. Then, we propose to train on multilingual auxiliary tasks – sentiment analysis, named entity recognition, and tasks relying on syntactic information – to improve zero-shot transfer of hate speech detection models across languages. We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning and highlight these tasks’ positive impact on bridging the hate speech linguistic and cultural gap between languages.

pdf bib abs
Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs
Ghazi Felhi | Joseph Le Roux | Djamé Seddah
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a generative model for text generation, which exhibits disentangled latent representations of syntax and semantics. Contrary to previous work, this model does not need syntactic information such as constituency parses, or semantic information such as paraphrase pairs. Our model relies solely on the inductive bias found in attention-based architectures such as Transformers. In the attention of Transformers, keys handle information selection while values specify what information is conveyed. Our model, dubbed QKVAE, uses Attention in its decoder to read latent variables where one latent variable infers keys while another infers values. We run experiments on latent representations and experiments on syntax/semantics transfer which show that QKVAE displays clear signs of disentangled syntax and semantics. We also show that our model displays competitive syntax transfer capabilities when compared to supervised models and that comparable supervised models need a fairly large amount of data (more than 50K samples) to outperform it on both syntactic and semantic transfer. The code for our experiments is publicly available.

pdf bib abs
Tâches Auxiliaires Multilingues pour le Transfert de Modèles de Détection de Discours Haineux (Multilingual Auxiliary Tasks for Zero-Shot Cross-Lingual Transfer of Hate Speech Detection)
Arij Riabi | Syrielle Montariol | Djamé Seddah
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

La tâche de détection de contenus haineux est ardue, car elle nécessite des connaissances culturelles et contextuelles approfondies ; les connaissances nécessaires varient, entre autres, selon la langue du locateur ou la cible du contenu. Or, des données annotées pour des domaines et des langues spécifiques sont souvent absentes ou limitées. C’est là que les données dans d’autres langues peuvent être exploitées ; mais du fait de ces variations, le transfert cross-lingue est souvent difficile. Dans cet article, nous mettons en évidence cette limitation pour plusieurs domaines et langues et montrons l’impact positif de l’apprentissage de tâches auxiliaires multilingues - analyse de sentiments, reconnaissance des entités nommées et tâches reposant sur des informations morpho-syntaxiques - sur le transfert cross-lingue zéro-shot des modèles de détection de discours haineux, afin de combler ce fossé culturel.

pdf bib abs
Quand être absent de mBERT n’est que le commencement : Gérer de nouvelles langues à l’aide de modèles de langues multilingues (When Being Unseen from mBERT is just the Beginning : Handling New Languages With Multilingual Language Models)
Benjamin Muller | Antonios Anastasopoulos | Benoît Sagot | Djamé Seddah
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

L’apprentissage par transfert basé sur le pré-entraînement de modèles de langue sur une grande quantité de données brutes est devenu la norme pour obtenir des performances état de l’art en TAL. Cependant, la façon dont cette approche devrait être appliquée pour des langues inconnues, qui ne sont couvertes par aucun modèle de langue multilingue à grande échelle et pour lesquelles seule une petite quantité de données brutes est le plus souvent disponible, n’est pas claire. Dans ce travail, en comparant des modèles multilingues et monolingues, nous montrons que de tels modèles se comportent de multiples façons sur des langues inconnues. Certaines langues bénéficient grandement de l’apprentissage par transfert et se comportent de manière similaire à des langues proches riches en ressource, alors que ce n’est manifestement pas le cas pour d’autres. En nous concentrant sur ces dernières, nous montrons dans ce travail que cet échec du transfert est largement lié à l’impact du script que ces langues utilisent. Nous montrons que la translittération de ces langues améliore considérablement le potentiel des larges modèles de langue neuronaux multilingues pour des tâches en aval. Ce résultat indique une piste prometteuse pour rendre ces modèles massivement multilingues utiles pour de nouveaux ensembles de langues absentes des données d’entraînement.

pdf bib abs
Fine-tuning and Sampling Strategies for Multimodal Role Labeling of Entities under Class Imbalance
Syrielle Montariol | Étienne Simon | Arij Riabi | Djamé Seddah
Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations

We propose our solution to the multimodal semantic role labeling task from the CONSTRAINT’22 workshop. The task aims at classifying entities in memes into classes such as “hero” and “villain”. We use several pre-trained multi-modal models to jointly encode the text and image of the memes, and implement three systems to classify the role of the entities. We propose dynamic sampling strategies to tackle the issue of class imbalance. Finally, we perform qualitative analysis on the representations of the entities.

The 2022 Multilingual Representation Learning (MRL) Shared Task was dedicated to clause-level morphology. As the first ever benchmark that defines and evaluates morphology outside its traditional lexical boundaries, the shared task on multilingual clause-level morphology sets the scene for competition across different approaches to morphological modeling, with 3 clause-level sub-tasks: morphological inflection, reinflection and analysis, where systems are required to generate, manipulate or analyze simple sentences centered around a single content lexeme and a set of morphological features characterizing its syntactic clause. This year’s tasks covered eight typologically distinct languages: English, French, German, Hebrew, Russian, Spanish, Swahili and Turkish. The tasks has received submissions of four systems from three teams which were compared to two baselines implementing prominent multilingual learning methods. The results show that modern NLP models are effective in solving morphological tasks even at the clause level. However, there is still room for improvement, especially in the task of morphological analysis.

2021

pdf bib abs
When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models
Benjamin Muller | Antonios Anastasopoulos | Benoît Sagot | Djamé Seddah
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.

pdf bib abs
Challenging the Semi-Supervised VAE Framework for Text Classification
Ghazi Felhi | Joseph Le Roux | Djamé Seddah
Proceedings of the Second Workshop on Insights from Negative Results in NLP

Semi-Supervised Variational Autoencoders (SSVAEs) are widely used models for data efficient learning. In this paper, we question the adequacy of the standard design of sequence SSVAEs for the task of text classification as we exhibit two sources of overcomplexity for which we provide simplifications. These simplifications to SSVAEs preserve their theoretical soundness while providing a number of practical advantages in the semi-supervised setup where the result of training is a text classifier. These simplifications are the removal of (i) the Kullback-Liebler divergence from its objective and (ii) the fully unobserved latent variable from its probabilistic model. These changes relieve users from choosing a prior for their latent variables, make the model smaller and faster, and allow for a better flow of information into the latent variables. We compare the simplified versions to standard SSVAEs on 4 text classification tasks. On top of the above-mentioned simplification, experiments show a speed-up of 26%, while keeping equivalent classification scores. The code to reproduce our experiments is public.

pdf bib
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)
Stephan Oepen | Kenji Sagae | Reut Tsarfaty | Gosse Bouma | Djamé Seddah | Daniel Zeman
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

pdf bib abs
From Raw Text to Enhanced Universal Dependencies: The Parsing Shared Task at IWPT 2021
Gosse Bouma | Djamé Seddah | Daniel Zeman
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

We describe the second IWPT task on end-to-end parsing from raw text to Enhanced Universal Dependencies. We provide details about the evaluation metrics and the datasets used for training and evaluation. We compare the approaches taken by participating teams and discuss the results of the shared task, also in comparison with the first edition of this task.

pdf bib abs
First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
Benjamin Muller | Yanai Elazar | Benoît Sagot | Djamé Seddah
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model’s internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.

pdf bib
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Dimitra Gkatzia | Djamé Seddah
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

pdf bib abs
Understanding the Impact of UGC Specificities on Translation Quality
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

This work takes a critical look at the evaluation of user-generated content automatic translation, the well-known specificities of which raise many challenges for MT. Our analyses show that measuring the average-case performance using a standard metric on a UGC test set falls far short of giving a reliable image of the UGC translation quality. That is why we introduce a new data set for the evaluation of UGC translation in which UGC specificities have been manually annotated using a fine-grained typology. Using this data set, we conduct several experiments to measure the impact of different kinds of UGC specificities on translation quality, more precisely than previously possible.

pdf bib abs
Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models
José Carlos Rosales Núñez | Guillaume Wisniewski | Djamé Seddah
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definition, cannot be seen at training time. Within a strict zero-shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.

pdf bib abs
Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?
Arij Riabi | Benoît Sagot | Djamé Seddah
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high- resource languages. Building language mod- els and, more generally, NLP systems for non- standardized and low-resource languages remains a challenging task. In this work, we fo- cus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data display- ing a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre- trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set- tings.

pdf bib abs
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering
Arij Riabi | Thomas Scialom | Rachel Keraron | Benoît Sagot | Djamé Seddah | Jacopo Staiano
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

2020

pdf bib
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies
Gosse Bouma | Yuji Matsumoto | Stephan Oepen | Kenji Sagae | Djamé Seddah | Weiwei Sun | Anders Søgaard | Reut Tsarfaty | Dan Zeman
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

pdf bib abs
Overview of the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies
Gosse Bouma | Djamé Seddah | Daniel Zeman
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

This overview introduces the task of parsing into enhanced universal dependencies, describes the datasets used for training and evaluation, and evaluation metrics. We outline various approaches and discuss the results of the shared task.

pdf bib abs
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l’hétérogénéité des données d’entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity )
Louis Martin | Benjamin Muller | Pedro Javier Ortiz Suárez | Yoann Dupont | Laurent Romary | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les modèles de langue neuronaux contextuels sont désormais omniprésents en traitement automatique des langues. Jusqu’à récemment, la plupart des modèles disponibles ont été entraînés soit sur des données en anglais, soit sur la concaténation de données dans plusieurs langues. L’utilisation pratique de ces modèles — dans toutes les langues sauf l’anglais — était donc limitée. La sortie récente de plusieurs modèles monolingues fondés sur BERT (Devlin et al., 2019), notamment pour le français, a démontré l’intérêt de ces modèles en améliorant l’état de l’art pour toutes les tâches évaluées. Dans cet article, à partir d’expériences menées sur CamemBERT (Martin et al., 2019), nous montrons que l’utilisation de données à haute variabilité est préférable à des données plus uniformes. De façon plus surprenante, nous montrons que l’utilisation d’un ensemble relativement petit de données issues du web (4Go) donne des résultats aussi bons que ceux obtenus à partir d’ensembles de données plus grands de deux ordres de grandeurs (138Go).

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.

pdf bib
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories
Kilian Evang | Laura Kallmeyer | Rafael Ehren | Simon Petitjean | Esther Seyffarth | Djamé Seddah
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

pdf bib abs
Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora
Hila Gonen | Ganesh Jawahar | Djamé Seddah | Yoav Goldberg
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and - as we show in this work - result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

2019

pdf bib abs
Contextualized Diachronic Word Representations
Ganesh Jawahar | Djamé Seddah
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Diachronic word embeddings play a key role in capturing interesting patterns about how language evolves over time. Most of the existing work focuses on studying corpora spanning across several decades, which is understandably still not a possibility when working on social media-based user-generated content. In this work, we address the problem of studying semantic changes in a large Twitter corpus collected over five years, a much shorter period than what is usually the norm in diachronic studies. We devise a novel attentional model, based on Bernoulli word embeddings, that are conditioned on contextual extra-linguistic (social) features such as network, spatial and socio-economic variables, which are associated with Twitter users, as well as topic-based features. We posit that these social features provide an inductive bias that helps our model to overcome the narrow time-span regime problem. Our extensive experiments reveal that our proposed model is able to capture subtle semantic shifts without being biased towards frequency cues and also works well when certain contextual features are absent. Our model fits the data better than current state-of-the-art dynamic word embedding models and therefore is a promising tool to study diachronic semantic changes over small time periods.

pdf bib abs
Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This work compares the performances achieved by Phrase-Based Statistical Machine Translation systems (PB-SMT) and attention-based Neuronal Machine Translation systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be expected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models.

pdf bib
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)
Marie Candito | Kilian Evang | Stephan Oepen | Djamé Seddah
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

pdf bib abs
Enhancing BERT for Lexical Normalization
Benjamin Muller | Benoit Sagot | Djamé Seddah
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neural models on a great variety of tasks. However, it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC), we study the ability of BERT to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge, it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.

pdf bib abs
Phonetic Normalization for Machine Translation of User Generated Content
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We present an approach to correct noisy User Generated Content (UGC) in French aiming to produce a pretreatement pipeline to improve Machine Translation for this kind of non-canonical corpora. In order to do so, we have implemented a character-based neural model phonetizer to produce IPA pronunciations of words. In this way, we intend to correct grammar, vocabulary and accentuation errors often present in noisy UGC corpora. Our method leverages on the fact that some errors are due to confusion induced by words with similar pronunciation which can be corrected using a phonetic look-up table to produce normalization candidates. These potential corrections are then encoded in a lattice and ranked using a language model to output the most probable corrected phrase. Compare to using other phonetizers, our method boosts a transformer-based machine translation system on UGC.

pdf bib abs
What Does BERT Learn about the Structure of Language?
Ganesh Jawahar | Benoît Sagot | Djamé Seddah
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

BERT is a recent language representation model that has surprisingly performed well in diverse language understanding benchmarks. This result indicates the possibility that BERT networks capture structural information about language. In this work, we provide novel support for this claim by performing a series of experiments to unpack the elements of English language structure learned by BERT. Our findings are fourfold. BERT’s phrasal representation captures the phrase-level information in the lower layers. The intermediate layers of BERT compose a rich hierarchy of linguistic information, starting with surface features at the bottom, syntactic features in the middle followed by semantic features at the top. BERT requires deeper layers while tracking subject-verb agreement to handle long-term dependency problem. Finally, the compositional scheme underlying BERT mimics classical, tree-like structures.

2018

pdf bib abs
ELMoLex: Connecting ELMo and Lexicon Features for Dependency Parsing
Ganesh Jawahar | Benjamin Muller | Amal Fethi | Louis Martin | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present the details of the neural dependency parser and the neural tagger submitted by our team ‘ParisNLP’ to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth, we call our system ‘ELMoLex’. In addition to incorporating character embeddings, ELMoLex benefits from pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%).

pdf bib
Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer
Djamé Seddah | Eric de la Clergerie | Benoît Sagot | Héctor Martínez Alonso | Marie Candito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Enhanced UD Dependencies with Neutralized Diathesis Alternation
Marie Candito | Bruno Guillaume | Guy Perrier | Djamé Seddah
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib abs
The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy
Éric de La Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present the ParisNLP entry at the UD CoNLL 2017 parsing shared task. In addition to the UDpipe models provided, we built our own data-driven tokenization models, sentence segmenter and lexicon-based morphological analyzers. All of these were used with a range of different parsing models (neural or not, feature-rich or not, transition or graph-based, etc.) and the best combination for each language was selected. Unfortunately, a glitch in the shared task’s Matrix led our model selector to run generic, weakly lexicalized models, tailored for surprise languages, instead of our dataset-specific models. Because of this #ParsingTragedy, we officially ranked 27th, whereas our real models finally unofficially ranked 6th.

2016

pdf bib abs
Hard Time Parsing Questions: Building a QuestionBank for French
Djamé Seddah | Marie Candito
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present the French Question Bank, a treebank of 2600 questions. We show that classical parsing model performance drop while the inclusion of this data set is highly beneficial without harming the parsing of non-question data. when facing out-of- domain data with strong structural diver- gences. Two thirds being aligned with the QB (Judge et al., 2006) and being freely available, this treebank will prove useful to build robust NLP systems.

pdf bib abs
Accurate Deep Syntactic Parsing of Graphs: The Case of French
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parsing predicate-argument structures in a deep syntax framework requires graphs to be predicted. Argument structures represent a higher level of abstraction than the syntactic ones and are thus more difficult to predict even for highly accurate parsing models on surfacic syntax. In this paper we investigate deep syntax parsing, using a French data set (Ribeyre et al., 2014a). We demonstrate that the use of topologically different types of syntactic features, such as dependencies, tree fragments, spines or syntactic paths, brings a much needed context to the parser. Our higher-order parsing model, gaining thus up to 4 points, establishes the state of the art for parsing French deep syntactic structures.

pdf bib abs
From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario
Héctor Martínez Alonso | Djamé Seddah | Benoît Sagot
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

User-generated content presents many challenges for its automatic processing. While many of them do come from out-of-vocabulary effects, others spawn from different linguistic phenomena such as unusual syntax. In this work we present a French three-domain data set made up of question headlines from a cooking forum, game chat logs and associated forums from two popular online games (MINECRAFT & LEAGUE OF LEGENDS). We chose these domains because they encompass different degrees of lexical and syntactic compliance with canonical language. We conduct an automatic and manual evaluation of the difficulties of processing these domains for part-of-speech prediction, and introduce a pilot study to determine whether dependency analysis lends itself well to annotate these data. We also discuss the development cost of our data set.

2015

pdf bib
Because Syntax Does Matter: Improving Predicate-Argument Structures Parsing with Syntactic Features
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Introducing the SPMRL 2014 Shared Task on Parsing Morphologically-rich Languages
Djamé Seddah | Sandra Kübler | Reut Tsarfaty
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

pdf bib
Alpage: Transition-based Semantic Graph Parsing with Syntactic Features
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Annotation scheme for deep dependency syntax of French (Un schéma d’annotation en dépendances syntaxiques profondes pour le français) [in French]
Guy Perrier | Marie Candito | Bruno Guillaume | Corentin Ribeyre | Karën Fort | Djamé Seddah
Proceedings of TALN 2014 (Volume 2: Short Papers)

We define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named deep-sequoia, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis.

2013

pdf bib
The LIGM-Alpage architecture for the SPMRL 2013 Shared Task: Multiword Expression Analysis and Dependency Parsing
Matthieu Constant | Marie Candito | Djamé Seddah
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Parsing Morphologically Rich Languages: Introduction to the Special Issue
Reut Tsarfaty | Djamé Seddah | Sandra Kübler | Joakim Nivre
Computational Linguistics, Volume 39, Issue 1 - March 2013

2012

pdf bib
The French Social Media Bank: a Treebank of Noisy User Generated Content
Djamé Seddah | Benoit Sagot | Marie Candito | Virginie Mouilleron | Vanessa Combet
Proceedings of COLING 2012

pdf bib
Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical (The Sequoia Corpus : Syntactic Annotation and Use for a Parser Lexical Domain Adaptation Method) [in French]
Marie Candito | Djamé Seddah
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib abs
Ubiquitous Usage of a Broad Coverage French Corpus: Processing the Est Republicain corpus
Djamé Seddah | Marie Candito | Benoit Crabbé | Enrique Henestroza Anguiano
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we introduce a set of resources that we have derived from the EST RÉPUBLICAIN CORPUS, a large, freely-available collection of regional newspaper articles in French, totaling 150 million words. Our resources are the result of a full NLP treatment of the EST RÉPUBLICAIN CORPUS: handling of multi-word expressions, lemmatization, part-of-speech tagging, and syntactic parsing. Processing of the corpus is carried out using statistical machine-learning approaches - joint model of data driven lemmatization and part- of-speech tagging, PCFG-LA and dependency based models for parsing - that have been shown to achieve state-of-the-art performance when evaluated on the French Treebank. Our derived resources are made freely available, and released according to the original Creative Common license for the EST RÉPUBLICAIN CORPUS. We additionally provide an overview of the use of these resources in various applications, in particular the use of generated word clusters from the corpus to alleviate lexical data sparseness for statistical parsing.

pdf bib
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages
Marianna Apidianaki | Ido Dagan | Jennifer Foster | Yuval Marton | Djamé Seddah | Reut Tsarfaty
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
Statistical Parsing of Spanish and Data Driven Lemmatization
Joseph Le Roux | Benoît Sagot | Djamé Seddah
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
A linguistically-motivated 2-stage Tree to Graph Transformation
Corentin Ribeyre | Djamé Seddah | Eric Villemonte de la Clergerie
Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11)

2011

pdf bib
A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts
Marie Candito | Enrique Henestroza Anguiano | Djamé Seddah
Proceedings of the 12th International Conference on Parsing Technologies

pdf bib
Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages
Djamé Seddah | Reut Tsarfaty | Jennifer Foster
Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages

2010

pdf bib
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Djame Seddah | Sandra Koebler | Reut Tsarfaty
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Parsing Word Clusters
Marie Candito | Djamé Seddah
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Lemmatization and Lexicalized Statistical Parsing of Morphologically-Rich Languages: the Case of French
Djamé Seddah | Grzegorz Chrupała | Özlem Çetinoğlu | Josef van Genabith | Marie Candito
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Control Verb, Argument Cluster Coordination and Multi Component TAG
Djamé Seddah | Benoit Sagot | Laurence Danlos
Proceedings of the 10th International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+10)

pdf bib abs
Exploring the Spinal-STIG Model for Parsing French
Djamé Seddah
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We evaluate statistical parsing of French using two probabilistic models derived from the Tree Adjoining Grammar framework: a Stochastic Tree Insertion Grammars model (STIG) and a specific instance of this formalism, called Spinal Tree Insertion Grammar model which exhibits interesting properties with regard to data sparseness issues common to small treebanks such as the Paris 7 French Treebank. Using David Chiangs STIG parser (Chiang, 2003), we present results of various experiments we conducted to explore those models for French parsing. The grammar induction makes use of a head percolation table tailored for the French Treebank and which is provided in this paper. Using two evaluation metrics, we found that the parsing performance of a STIG model is tied to the size of the underlying Tree Insertion Grammar, with a more compact grammar, a spinal STIG, outperforming a genuine STIG. We finally note that a ""spinal"" framework seems to emerge in the literature. Indeed, the use of vertical grammars such as Spinal STIG instead of horizontal grammars such as PCFGs, afflicted with well known data sparseness issues, seems to be a promising path toward better parsing performance.

2009

bib
On Statistical Parsing of French with Supervised and Semi-Supervised Strategies
Marie Candito | Benoit Crabbé | Djamé Seddah
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

pdf bib
Cross parser evaluation : a French Treebanks study
Djamé Seddah | Marie Candito | Benoît Crabbé
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib abs
Adaptation de parsers statistiques lexicalisés pour le français : Une évaluation complète sur corpus arborés
Djamé Seddah | Marie Candito | Benoît Crabbé
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente les résultats d’une évaluation exhaustive des principaux analyseurs syntaxiques probabilistes dit “lexicalisés” initialement conçus pour l’anglais, adaptés pour le français et évalués sur le CORPUS ARBORÉ DU FRANÇAIS (Abeillé et al., 2003) et le MODIFIED FRENCH TREEBANK (Schluter & van Genabith, 2007). Confirmant les résultats de (Crabbé & Candito, 2008), nous montrons que les modèles lexicalisés, à travers les modèles de Charniak (Charniak, 2000), ceux de Collins (Collins, 1999) et le modèle des TIG Stochastiques (Chiang, 2000), présentent des performances moindres face à un analyseur PCFG à Annotation Latente (Petrov et al., 2006). De plus, nous montrons que le choix d’un jeu d’annotations issus de tel ou tel treebank oriente fortement les résultats d’évaluations tant en constituance qu’en dépendance non typée. Comparés à (Schluter & van Genabith, 2008; Arun & Keller, 2005), tous nos résultats sont state-of-the-art et infirment l’hypothèse d’une difficulté particulière qu’aurait le français en terme d’analyse syntaxique probabiliste et de sources de données.

Nous présentons dans cet article une approche générale pour la modélisation et l’analyse syntaxique des coordinations elliptiques. Nous montrons que les lexèmes élidés peuvent être remplacés, au cours de l’analyse, par des informations qui proviennent de l’autre membre de la coordination, utilisé comme guide au niveau des dérivations. De plus, nous montrons comment cette approche peut être effectivement mise en oeuvre par une légère extension des Grammaires d’Arbres Adjoints Lexicalisées (LTAG) à travers une opération dite de fusion. Nous décrivons les algorithmes de dérivation nécessaires pour l’analyse de constructions coordonnées pouvant comporter un nombre quelconque d’ellipses.

2005

pdf bib abs
Des arbres de dérivation aux forêts de dépendance : un chemin via les forêts partagées
Djamé Seddah | Bertrand Gaiffe
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’objectif de cet article est de montrer comment bâtir une structure de répresentation proche d’un graphe de dépendance à l’aide des deux structures de représentation canoniques fournies par les Grammaires d’Arbres Adjoints Lexicalisées . Pour illustrer cette approche, nous décrivons comment utiliser ces deux structures à partir d’une forêt partagée.

2002

pdf bib abs
Conceptualisation d’un système d’informations lexicales, une interface paramétrable pour le T.A.L
Djamé Seddah | Evelyne Jacquey
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

La nécessité de ressources lexicales normalisées et publiques est avérée dans le domaine du TAL. Cet article vise à montrer comment, sur la base d’une partie du lexique MULTEXT disponible sur le serveur ABU, il serait possible de construire une architecture permettant tout à la fois l’accès aux ressources avec des attentes différentes (lemmatiseur, parseur, extraction d’informations, prédiction, etc.) et la mise à jour par un groupe restreint de ces ressources. Cette mise à jour consistant en l’intégration et la modification, automatique ou manuelle, de données existantes. Pour ce faire, nous cherchons à prendre en compte à la fois les besoins et les données accessibles. Ce modèle est évalué conceptuellement dans un premier temps en fonction des systèmes utilisés dans notre équipe : un analyseur TAG, un constructeur de grammaires TAGs, un extracteur d’information.