Frederick Reiss

Also published as: Frederick R. Reiss


2021

pdf bib
Data Cleaning Tools for Token Classification Tasks
Karthik Muthuraman | Frederick Reiss | Hong Xu | Bryan Cutler | Zachary Eichenberger
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

Human-in-the-loop systems for cleaning NLP training data rely on automated sieves to isolate potentially-incorrect labels for manual review. We have developed a novel technique for flagging potentially-incorrect labels with high sensitivity in named entity recognition corpora. We incorporated our sieve into an end-to-end system for cleaning NLP corpora, implemented as a modular collection of Jupyter notebooks built on extensions to the Pandas DataFrame library. We used this system to identify incorrect labels in the CoNLL-2003 corpus for English-language named entity recognition (NER), one of the most influential corpora for NER model research. Unlike previous work that only looked at a subset of the corpus’s validation fold, our automated sieve enabled us to examine the entire corpus in depth. Across the entire CoNLL-2003 corpus, we identified over 1300 incorrect labels (out of 35089 in the corpus). We have published our corrections, along with the code we used in our experiments. We are developing a repeatable version of the process we used on the CoNLL-2003 corpus as an open-source library.

2020

pdf bib
Identifying Incorrect Labels in the CoNLL-2003 Corpus
Frederick Reiss | Hong Xu | Bryan Cutler | Karthik Muthuraman | Zachary Eichenberger
Proceedings of the 24th Conference on Computational Natural Language Learning

The CoNLL-2003 corpus for English-language named entity recognition (NER) is one of the most influential corpora for NER model research. A large number of publications, including many landmark works, have used this corpus as a source of ground truth for NER tasks. In this paper, we examine this corpus and identify over 1300 incorrect labels (out of 35089 in the corpus). In particular, the number of incorrect labels in the test fold is comparable to the number of errors that state-of-the-art models make when running inference over this corpus. We describe the process by which we identified these incorrect labels, using novel variants of techniques from semi-supervised learning. We also summarize the types of errors that we found, and we revisit several recent results in NER in light of the corrected data. Finally, we show experimentally that our corrections to the corpus have a positive impact on three state-of-the-art models.

2018

pdf bib
SystemT: Declarative Text Understanding for Enterprise
Laura Chiticariu | Marina Danilevsky | Yunyao Li | Frederick Reiss | Huaiyu Zhu
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

The rise of enterprise applications over unstructured and semi-structured documents poses new challenges to text understanding systems across multiple dimensions. We present SystemT, a declarative text understanding system that addresses these challenges and has been deployed in a wide range of enterprise applications. We highlight the design considerations and decisions behind SystemT in addressing the needs of the enterprise setting. We also summarize the impact of SystemT on business and education.

2015

bib
Transparent Machine Learning for Information Extraction: State-of-the-Art and the Future
Laura Chiticariu | Yunyao Li | Frederick Reiss
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

The rise of Big Data analytics over unstructured text has led to renewed interest in information extraction (IE). These applications need effective IE as a first step towards solving end-to-end real world problems (e.g. biology, medicine, finance, media and entertainment, etc). Much recent NLP research has focused on addressing specific IE problems using a pipeline of multiple machine learning techniques. This approach requires an analyst with the expertise to answer questions such as: “What ML techniques should I combine to solve this problem?”; “What features will be useful for the composite pipeline?”; and “Why is my model giving the wrong answer on this document?”. The need for this expertise creates problems in real world applications. It is very difficult in practice to find an analyst who both understands the real world problem and has deep knowledge of applied machine learning. As a result, the real impact by current IE research does not match up to the abundant opportunities available.In this tutorial, we introduce the concept of transparent machine learning. A transparent ML technique is one that:- produces models that a typical real world use can read and understand;- uses algorithms that a typical real world user can understand; and- allows a real world user to adapt models to new domains.The tutorial is aimed at IE researchers in both the academic and industry communities who are interested in developing and applying transparent ML.

2013

pdf bib
Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!
Laura Chiticariu | Yunyao Li | Frederick R. Reiss
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

pdf bib
WizIE: A Best Practices Guided Development Environment for Information Extraction
Yunyao Li | Laura Chiticariu | Huahai Yang | Frederick Reiss | Arnaldo Carreno-fuentes
Proceedings of the ACL 2012 System Demonstrations

2011

pdf bib
SystemT: A Declarative Information Extraction System
Yunyao Li | Frederick Reiss | Laura Chiticariu
Proceedings of the ACL-HLT 2011 System Demonstrations

2010

pdf bib
Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks
Laura Chiticariu | Rajasekar Krishnamurthy | Yunyao Li | Frederick Reiss | Shivakumar Vaithyanathan
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
SystemT: An Algebraic Approach to Declarative Information Extraction
Laura Chiticariu | Rajasekar Krishnamurthy | Yunyao Li | Sriram Raghavan | Frederick Reiss | Shivakumar Vaithyanathan
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics