Michele Banko


2021

pdf bib
Practical Transformer-based Multilingual Text Classification
Cindy Wang | Michele Banko
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Transformer-based methods are appealing for multilingual text classification, but common research benchmarks like XNLI (Conneau et al., 2018) do not reflect the data availability and task variety of industry applications. We present an empirical comparison of transformer-based text classification models in a variety of practical monolingual and multilingual pretraining and fine-tuning settings. We evaluate these methods on two distinct tasks in five different languages. Departing from prior work, our results show that multilingual language models can outperform monolingual ones in some downstream tasks and target languages. We additionally show that practical modifications such as task- and domain-adaptive pretraining and data augmentation can improve classification performance without the need for additional labeled data.

2020

pdf bib
A Unified Taxonomy of Harmful Content
Michele Banko | Brendon MacKeen | Laurie Ray
Proceedings of the Fourth Workshop on Online Abuse and Harms

The ability to recognize harmful content within online communities has come into focus for researchers, engineers and policy makers seeking to protect users from abuse. While the number of datasets aiming to capture forms of abuse has grown in recent years, the community has not standardized around how various harmful behaviors are defined, creating challenges for reliable moderation, modeling and evaluation. As a step towards attaining shared understanding of how online abuse may be modeled, we synthesize the most common types of abuse described by industry, policy, community and health experts into a unified typology of harmful content, with detailed criteria and exceptions for each type of abuse.

2019

pdf bib
Keeping Notes: Conditional Natural Language Generation with a Scratchpad Encoder
Ryan Benmalek | Madian Khabsa | Suma Desu | Claire Cardie | Michele Banko
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We introduce the Scratchpad Mechanism, a novel addition to the sequence-to-sequence (seq2seq) neural network architecture and demonstrate its effectiveness in improving the overall fluency of seq2seq models for natural language generation tasks. By enabling the decoder at each time step to write to all of the encoder output layers, Scratchpad can employ the encoder as a “scratchpad” memory to keep track of what has been generated so far and thereby guide future generation. We evaluate Scratchpad in the context of three well-studied natural language generation tasks — Machine Translation, Question Generation, and Text Summarization — and obtain state-of-the-art or comparable performance on standard datasets for each task. Qualitative assessments in the form of human judgements (question generation), attention visualization (MT), and sample output (summarization) provide further evidence of the ability of Scratchpad to generate fluent and expressive output.

pdf bib
Improving Knowledge Base Construction from Robust Infobox Extraction
Boya Peng | Yejin Huh | Xiao Ling | Michele Banko
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)

A capable, automatic Question Answering (QA) system can provide more complete and accurate answers using a comprehensive knowledge base (KB). One important approach to constructing a comprehensive knowledge base is to extract information from Wikipedia infobox tables to populate an existing KB. Despite previous successes in the Infobox Extraction (IBE) problem (e.g., DBpedia), three major challenges remain: 1) Deterministic extraction patterns used in DBpedia are vulnerable to template changes; 2) Over-trusting Wikipedia anchor links can lead to entity disambiguation errors; 3) Heuristic-based extraction of unlinkable entities yields low precision, hurting both accuracy and completeness of the final KB. This paper presents a robust approach that tackles all three challenges. We build probabilistic models to predict relations between entity mentions directly from the infobox tables in HTML. The entity mentions are linked to identifiers in an existing KB if possible. The unlinkable ones are also parsed and preserved in the final output. Training data for both the relation extraction and the entity linking models are automatically generated using distant supervision. We demonstrate the empirical effectiveness of the proposed method in both precision and recall compared to a strong IBE baseline, DBpedia, with an absolute improvement of 41.3% in average F1. We also show that our extraction makes the final KB significantly more complete, improving the completeness score of list-value relation types by 61.4%.

2008

pdf bib
The Tradeoffs Between Open and Traditional Relation Extraction
Michele Banko | Oren Etzioni
Proceedings of ACL-08: HLT

2007

pdf bib
TextRunner: Open Information Extraction on the Web
Alexander Yates | Michele Banko | Matthew Broadhead | Michael Cafarella | Oren Etzioni | Stephen Soderland
Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT)

2004

pdf bib
Using N-Grams To Understand the Nature of Summaries
Michele Banko | Lucy Vanderwende
Proceedings of HLT-NAACL 2004: Short Papers

pdf bib
Part-of-Speech Tagging in Context
Michele Banko | Robert C. Moore
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2002

pdf bib
An Analysis of the AskMSR Question-Answering System
Eric Brill | Susan Dumais | Michele Banko
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

2001

pdf bib
Scaling to Very Very Large Corpora for Natural Language Disambiguation
Michele Banko | Eric Brill
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

pdf bib
Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing
Michele Banko | Eric Brill
Proceedings of the First International Conference on Human Language Technology Research

2000

pdf bib
Headline Generation Based on Statistical Translation
Michele Banko | Vibhu O. Mittal | Michael J. Witbrock
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics