Christopher Cieri

Also published as: Chris Cieri

2022

pdf bib abs
Reflections on 30 Years of Language Resource Development and Sharing
Christopher Cieri | Mark Liberman | Sunghye Cho | Stephanie Strassel | James Fiumara | Jonathan Wright
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Linguistic Data Consortium was founded in 1992 to solve the problem that limitations in access to shareable data was impeding progress in Human Language Technology research and development. At the time, DARPA had adopted the common task research management paradigm to impose additional rigor on their programs by also providing shared objectives, data and evaluation methods. Early successes underscored the promise of this paradigm but also the need for a standing infrastructure to host and distribute the shared data. During LDC’s initial five year grant, it became clear that the demand for linguistic data could not easily be met by the existing providers and that a dedicated data center could add capacity first for data collection and shortly thereafter for annotation. The expanding purview required expansions of LDC’s technical infrastructure including systems support and software development. An open question for the center would be its role in other kinds of research beyond data development. Over its 30 years history, LDC has performed multiple roles ranging from neutral, independent data provider to multisite programs, to creator of exploratory data in tight collaboration with system developers, to research group focused on data intensive investigations.

pdf bib
Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022
Chris Callison-Burch | Christopher Cieri | James Fiumara | Mark Liberman
Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022

pdf bib abs
The NIEUW Project: Developing Language Resources through Novel Incentives
James Fiumara | Christopher Cieri | Mark Liberman | Chris Callison-Burch | Jonathan Wright | Robert Parker
Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022

This paper provides an overview and update on the Linguistic Data Consortium’s (LDC) NIEUW (Novel Incentives and Workflows) project supported by the National Science Foundation and part of LDC’s larger goal of improving the cost, variety, scale, and quality of language resources available for education, research, and technology development. NIEUW leverages the power of novel incentives to elicit linguistic data and annotations from a wide variety of contributors including citizen scientists, game players, and language students and professionals. In order to align appropriate incentives with the various contributors, LDC has created three distinct web portals to bring together researchers and other language professionals with participants best suited to their project needs. These portals include LanguageARC designed for citizen scientists, Machina Pro Linguistica designed for students and language professionals, and LingoBoingo designed for game players. The design, interface, and underlying tools for each web portal were developed to appeal to the different incentives and motivations of their respective target audiences.

pdf bib abs
Using Mixed Incentives to Document Xi’an Guanzhong
Juhong Zhan | Yue Jiang | Christopher Cieri | Mark Liberman | Jiahong Yuan | Yiya Chen | Odette Scharenborg
Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022

This paper describes our use of mixed incentives and the citizen science portal LanguageARC to prepare, collect and quality control a large corpus of object namings for the purpose of providing speech data to document the under-represented Guanzhong dialect of Chinese spoken in the Shaanxi province in the environs of Xi’an.

This study examined differences in linguistic features produced by autistic and neurotypical (NT) children during brief picture descriptions, and assessed feature stability over time. Weekly speech samples from well-characterized participants were collected using a telephony system designed to improve access for geographically isolated and historically marginalized communities. Results showed stable group differences in certain acoustic features, some of which may potentially serve as key outcome measures in future treatment studies. These results highlight the importance of eliciting semi-structured speech samples in a variety of contexts over time, and adds to a growing body of research showing that fine-grained naturalistic communication features hold promise for intervention research.

2020

pdf bib
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"
James Fiumara | Christopher Cieri | Mark Liberman | Chris Callison-Burch
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

pdf bib abs
LanguageARC: Developing Language Resources Through Citizen Linguistics
James Fiumara | Christopher Cieri | Jonathan Wright | Mark Liberman
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

This paper introduces the citizen science platform, LanguageARC, developed within the NIEUW (Novel Incentives and Workflows) project supported by the National Science Foundation under Grant No. 1730377. LanguageARC is a community-oriented online platform bringing together researchers and “citizen linguists” with the shared goal of contributing to linguistic research and language technology development. Like other Citizen Science platforms and projects, LanguageARC harnesses the power and efforts of volunteers who are motivated by the incentives of contributing to science, learning and discovery, and belonging to a community dedicated to social improvement. Citizen linguists contribute language data and judgments by participating in research tasks such as classifying regional accents from audio clips, recording audio of picture descriptions and answering personality questionnaires to create baseline data for NLP research into autism and neurodegenerative conditions. Researchers can create projects on Language ARC without any coding or HTML required using our Project Builder Toolkit.

pdf bib abs
LanguageARC - a tutorial
Christopher Cieri | James Fiumara
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

LanguageARC is a portal that offers citizen linguists opportunities to contribute to language related research. It also provides researchers with infrastructure for easily creating data collection and annotation tasks on the portal and potentially connecting with contributors. This document describes LanguageARC’s main features and operation for researchers interested in creating new projects and or using the resulting data.

pdf bib abs
Stretching Disciplinary Boundaries in Language Resource Development and Use: a Linguistic Data Consortium Position Paper
Christopher Cieri
Proceedings of the Workshop about Language Resources for the SSH Cloud

Given the persistent gap between demand and supply, the impetus to reuse language resources is great. Researchers benefit from building upon the work of others including reusing data, tools and methodology. Such reuse should always consider the original intent of the language resource and how that impacts potential reanalysis. When the reuse crosses disciplinary boundaries, the re-user also needs to consider how research standards that differ between social science and humanities on the one hand and human language technologies on the other might lead to differences in unspoken assumptions. Data centers that aim to support multiple research communities have a responsibility to build bridges across disciplinary divides by sharing data in all directions, encouraging re-use and re-sharing and engaging directly in research that improves methodologies.

pdf bib abs
Related Works in the Linguistic Data Consortium Catalog
Daniel Jaquette | Christopher Cieri | Denise DiPersio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Defining relations between language resources provides an archive with the ability to better serve its users. This paper covers the development and implementation of a Related Works addition to the Linguistic Data Consortium’s (LDC) catalog. The authors go step-by-step through the development of the Related Works schema, implementation of the software and database changes, and data entry of the relations. The Related Work schema involved developing of a set of controlled terms for relations based on previous work and other schema. Software and database changes consisted of both front and back end interface additions, along with modification and additions to the LDC Catalog database tables. Data entry consisted of two parts: seed data from previous work and 2019 language resources, and ongoing legacy population. Previous work in this area is discussed as well as overview information about the LDC Catalog. A list of the full LDC Related Works terms is included with brief explanations.

pdf bib abs
A Progress Report on Activities at the Linguistic Data Consortium Benefitting the LREC Community
Christopher Cieri | James Fiumara | Stephanie Strassel | Jonathan Wright | Denise DiPersio | Mark Liberman
Proceedings of the Twelfth Language Resources and Evaluation Conference

This latest in a series of Linguistic Data Consortium (LDC) progress reports to the LREC community does not describe any single language resource, evaluation campaign or technology but sketches the activities, since the last report, of a data center devoted to supporting the work of LREC attendees among other research communities. Specifically, we describe 96 new corpora released in 2018-2020 to date, a new technology evaluation campaign, ongoing activities to support multiple common task human language technology programs, and innovations to advance the methodology of language data collection and annotation.

2018

pdf bib
Introducing NIEUW: Novel Incentives and Workflows for Eliciting Linguistic Data
Christopher Cieri | James Fiumara | Mark Liberman | Chris Callison-Burch | Jonathan Wright
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
From ‘Solved Problems’ to New Challenges: A Report on LDC Activities
Christopher Cieri | Mark Liberman | Stephanie Strassel | Denise DiPersio | Jonathan Wright | Andrea Mazzucchi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs
The Language Application Grid and Galaxy
Nancy Ide | Keith Suderman | James Pustejovsky | Marc Verhagen | Christopher Cieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The NSF-SI2-funded LAPPS Grid project is a collaborative effort among Brandeis University, Vassar College, Carnegie-Mellon University (CMU), and the Linguistic Data Consortium (LDC), which has developed an open, web-based infrastructure through which resources can be easily accessed and within which tailored language services can be efficiently composed, evaluated, disseminated and consumed by researchers, developers, and students across a wide variety of disciplines. The LAPPS Grid project recently adopted Galaxy (Giardine et al., 2005), a robust, well-developed, and well-supported front end for workflow configuration, management, and persistence. Galaxy allows data inputs and processing steps to be selected from graphical menus, and results are displayed in intuitive plots and summaries that encourage interactive workflows and the exploration of hypotheses. The Galaxy workflow engine provides significant advantages for deploying pipelines of LAPPS Grid web services, including not only means to create and deploy locally-run and even customized versions of the LAPPS Grid as well as running the LAPPS Grid in the cloud, but also access to a huge array of statistical and visualization tools that have been developed for use in genomics research.

pdf bib abs
Trends in HLT Research: A Survey of LDC’s Data Scholarship Program
Denise DiPersio | Christopher Cieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Since its inception in 2010, the Linguistic Data Consortium’s data scholarship program has awarded no cost grants in data to 64 recipients from 26 countries. A survey of the twelve cycles to date ― two awards each in the Fall and Spring semesters from Fall 2010 through Spring 2016 ― yields an interesting view into graduate program research trends in human language technology and related fields and the particular data sets deemed important to support that research. The survey also reveals regions in which such activity appears to be on a rise, including in Arabic-speaking regions and portions of the Americas and Asia.

pdf bib abs
Building Language Resources for Exploring Autism Spectrum Disorders
Julia Parish-Morris | Christopher Cieri | Mark Liberman | Leila Bateman | Emily Ferguson | Robert T. Schultz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Autism spectrum disorder (ASD) is a complex neurodevelopmental condition that would benefit from low-cost and reliable improvements to screening and diagnosis. Human language technologies (HLTs) provide one possible route to automating a series of subjective decisions that currently inform “Gold Standard” diagnosis based on clinical judgment. In this paper, we describe a new resource to support this goal, comprised of 100 20-minute semi-structured English language samples labeled with child age, sex, IQ, autism symptom severity, and diagnostic classification. We assess the feasibility of digitizing and processing sensitive clinical samples for data sharing, and identify areas of difficulty. Using the methods described here, we propose to join forces with researchers and clinicians throughout the world to establish an international repository of annotated language samples from individuals with ASD and related disorders. This project has the potential to improve the lives of individuals with ASD and their families by identifying linguistic features that could improve remote screening, inform personalized intervention, and promote advancements in clinically-oriented HLTs.

pdf bib abs
Data Management Plans and Data Centers
Denise DiPersio | Christopher Cieri | Daniel Jaquette
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Data management plans, data sharing plans and the like are now required by funders worldwide as part of research proposals. Concerned with promoting the notion of open scientific data, funders view such plans as the framework for satisfying the generally accepted requirements for data generated in funded research projects, among them that it be accessible, usable, standardized to the degree possible, secure and stable. This paper examines the origins of data management plans, their requirements and issues they raise for data centers and HLT resource development in general.

pdf bib abs
Selection Criteria for Low Resource Language Programs
Christopher Cieri | Mike Maxwell | Stephanie Strassel | Jennifer Tracey
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper documents and describes the criteria used to select languages for study within programs that include low resource languages whether given that label or another similar one. It focuses on five US common task, Human Language Technology research and development programs in which the authors have provided information or consulting related to the choice of language. The paper does not describe the actual selection process which is the responsibility of program management and highly specific to a program’s individual goals and context. Instead it concentrates on the data and criteria that have been considered relevant previously with the thought that future program managers and their consultants may adapt these and apply them with different prioritization to future programs.

2014

pdf bib
Intellectual Property Rights Management with Web Service Grids
Christopher Cieri | Denise DiPersio
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

pdf bib abs
New Directions for Language Resource Development and Distribution
Christopher Cieri | Denise DiPersio | Mark Liberman | Andrea Mazzucchi | Stephanie Strassel | Jonathan Wright
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Despite the growth in the number of linguistic data centers around the world, their accomplishments and expansions and the advances they have help enable, the language resources that exist are a small fraction of those required to meet the goals of Human Language Technologies (HLT) for the worlds languages and the promises they offer: broad access to knowledge, direct communication across language boundaries and engagement in a global community. Using the Linguistic Data Consortium as a focus case, this paper sketches the progress of data centers, summarizes recent activities and then turns to several issues that have received inadequate attention and proposes some new approaches to their resolution.

The Language Application (LAPPS) Grid project is establishing a framework that enables language service discovery, composition, and reuse and promotes sustainability, manageability, usability, and interoperability of natural language Processing (NLP) components. It is based on the service-oriented architecture (SOA), a more recent, web-oriented version of the pipeline architecture that has long been used in NLP for sequencing loosely-coupled linguistic analyses. The LAPPS Grid provides access to basic NLP processing tools and resources and enables pipelining such tools to create custom NLP applications, as well as composite services such as question answering and machine translation together with language resources such as mono- and multi-lingual corpora and lexicons that support NLP. The transformative aspect of the LAPPS Grid is that it orchestrates access to and deployment of language resources and processing functions available from servers around the globe and enables users to add their own language resources, services, and even service grids to satisfy their particular needs.

pdf bib abs
Facing the Identification Problem in Language-Related Scientific Data Analysis.
Joseph Mariani | Christopher Cieri | Gil Francopoulo | Patrick Paroubek | Marine Delaborde
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the problems that must be addressed when studying large amounts of data over time which require entity normalization applied not to the usual genres of news or political speech, but to the genre of academic discourse about language resources, technologies and sciences. It reports on the normalization processes that had to be applied to produce data usable for computing statistics in three past studies on the LRE Map, the ISCA Archive and the LDC Bibliography. It shows the need for human expertise during normalization and the necessity to adapt the work to the study objectives. It investigates possible improvements for reducing the workload necessary to produce comparable results. Through this paper, we show the necessity to define and agree on international persistent and unique identifiers.

pdf bib abs
Developing a Framework for Describing Relations among Language Resources
Penny Labropoulou | Christopher Cieri | Maria Gavrilidou
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we study relations holding between language resources as implemented in activities concerned with their documentation. We envision the term language resources with an inclusive definition covering datasets (corpora, lexica, ontologies, grammars, etc.), tools (including web services, workflows, platforms etc.), related publications and documentation, specifications and guidelines. However, the scope of the paper is limited to relations holding for datasets and tools. The study fosuses on the META-SHARE infrastructure and the Linguistic Data Consortium and takes into account the ISOcat DCR relations. Based on this study, we propose a taxonomy of relations, discuss their semantics and provide specifications for their use in order to cater for semantic interoperability. Issues of granularity, redundancy in codification, naming conventions and semantics of the relations are presented.

2012

pdf bib abs
LDC Language Resource Database: Building a Bibliographic Database
Eleftheria Ahtaridis | Christopher Cieri | Denise DiPersio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The Linguistic Data Consortium (LDC) creates and provides language resources (LRs) including data, tools and specifications. In order to assess the impact of these LRs and to support both LR users and authors, LDC is collecting metadata about and URLs for research papers that introduce, describe, critique, extend or rely upon LDC LRs. Current collection efforts focus on papers published in journals and conference proceedings that are available online. To date, nearly 300, or over half of the LRs LDC distributes have been searched for extensively and almost 8000 research papers about these LRs have been documented. This paper discusses the issues with collecting references and includes preliminary analysis of those results. The remaining goals of the project are also outlined.

pdf bib abs
Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities
Christopher Cieri | Marian Reed | Denise DiPersio | Mark Liberman
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

On the Linguistic Data Consortium's (LDC) 20th anniversary, this paper describes the changes to the language resource landscape over the past two decades, how LDC has adjusted its practice to adapt to them and how the business model continues to grow. Specifically, we will discuss LDC's evolving roles and changes in the sizes and types of LDC language resources (LR) as well as the data they include and the annotations of that data. We will also discuss adaptations of the LDC business model and the sponsored projects it supports.

2010

In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUSTâs acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance.

The Greybeard Project was designed so as to enable research in speaker recognition using data that have been collected over a long period of time. Since 1994, LDC has been collecting speech samples for use in research and evaluations. By mining our earlier collections we assembled a list of subjects who had participated in multiple studies. These participants were then contacted and asked to take part in the Greybeard Project. The only constraint was that the participants must have made numerous calls in prior studies and the calls had to be a minimum of two years old. The archived data was sorted by participant and subsequent calls were added to their files. This is the first longitudinal study of its kind. The resulting corpus contains multiple calls for each participant that span anywhere from two to 12 years in time. It is our hope that these data will enable speaker recognition researchers to explore the effects of aging on voice.

Linguistic Data Consortiums Human Subjects Data Collection lab conducts multi-modal speech collections to develop corpora for use in speech, speaker and language research and evaluations. The Mixer collections have evolved over the years to best accommodate the ever changing needs of the research community and to hopefully keep one step ahead by providing increasingly challenging data. Over the years Mixer collections have grown to include socio-linguistic interviews, a wide variety of telephone conditions and multiple languages, recording conditions, channels and speech acts.. Mixer 6 was the most recent collection. This paper describes the Mixer 6 Phase 1 project. Mixer 6 Phase 1 was a study supporting linguistic research, technology development and education. The object of this study was to record speech in a variety of situations that vary formality and model multiple naturally occurring interactions as well as a variety of channel conditions

LRs remain expensive to create and thus rare relative to demand across languages and technology types. The accidental re-creation of an LR that already exists is a nearly unforgivable waste of scarce resources that is unfortunately not so easy to avoid. The number of catalogs the HLT researcher must search, with their different formats, make it possible to overlook an existing resource. This paper sketches the sources of this problem and outlines a proposal to rectify along with a new vision of LR cataloging that will to facilitates the documentation and exploitation of a much wider range of LRs than previously considered.

pdf bib abs
Adapting to Trends in Language Resource Development: A Progress Report on LDC Activities
Christopher Cieri | Mark Liberman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes changing needs among the communities that exploit language resources and recent LDC activities and publications that support those needs by providing greater volumes of data and associated resources in a growing inventory of languages with ever more sophisticated annotation. Specifically, it covers the evolving role of data centers with specific emphasis on the LDC, the publications released by the LDC in the two years since our last report and the sponsored research programs that provide LRs initially to participants in those programs but eventually to the larger HLT research communities and beyond.

2009

pdf bib
Basic Language Resources for Diverse Asian Languages: A Streamlined Approach for Resource Creation
Heather Simpson | Kazuaki Maeda | Christopher Cieri
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)

2008

pdf bib abs
The Linguistic Data Consortium Member Survey: Purpose, Execution and Results
Marian Reed | Denise DiPersio | Christopher Cieri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Linguistic Data Consortium (LDC) seeks to provide its members with quality linguistic resources and services. In order to pursue these ideals and to remain current, LDC monitors the needs and sentiments of its communities. One mechanism LDC uses to generate feedback on consortium and resource issues is the LDC Member Survey. The survey allows LDC Members and nonmembers to provide LDC with valuable insight into their own unique circumstances, their current and future data needs and their views on LDCs role in meeting them. When the 2006 Survey was found to be a useful tool for communicating with the Consortium membership, a 2007 Survey was organized and administered. As a result of the surveys, LDC has confirmed that it has made a positive impact on the community and has identified ways to improve the quality of service and the diversity of monthly offerings. Many respondents recommended ways to improve LDCs functions, ordering mechanism and webpage. Some of these comments have inspired changes to LDCs operation and strategy.

pdf bib abs
15 Years of Language Resource Creation and Sharing: a Progress Report on LDC Activities
Christopher Cieri | Mark Liberman
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper, the fifth in a series of biennial progress reports, reviews the activities of the Linguistic Data Consortium with particular emphasis on general trends in the language resource landscape and on changes that distinguish the two years since LDCs last report at LREC from the preceding 8 years. After providing a perspective on the current landscape of language resources, the paper goes on to describe our vision of the role of LDC within the research communities it serves before sketching briefly specific publications and resources creations projects that have been the focus our attention since the last report.

pdf bib abs
Bridging the Gap between Linguists and Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and Speaker Recognition
Christopher Cieri | Stephanie Strassel | Meghan Glenn | Reva Schwartz | Wade Shen | Joseph Campbell
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recent years have seen increased interest within the speaker recognition community in high-level features including, for example, lexical choice, idiomatic expressions or syntactic structures. The promise of speaker recognition in forensic applications drives development toward systems robust to channel differences by selecting features inherently robust to channel difference. Within the language recognition community, there is growing interest in differentiating not only languages but also mutually intelligible dialects of a single language. Decades of research in dialectology suggest that high-level features can enable systems to cluster speakers according to the dialects they speak. The Phanotics (Phonetic Annotation of Typicality in Conversational Speech) project seeks to identify high-level features characteristic of American dialects, annotate a corpus for these features, use the data to dialect recognition systems and also use the categorization to create better models for speaker recognition. The data, once published, should be useful to other developers of speaker and dialect recognition systems and to dialectologists and sociolinguists. We expect the methods will generalize well beyond the speakers, dialects, and languages discussed here and should, if successful, provide a model for how linguists and technology developers can collaborate in the future for the benefit of both groups and toward a deeper understanding of how languages vary and change.

pdf bib abs
Speaker Recognition: Building the Mixer 4 and 5 Corpora
Linda Brandschain | Christopher Cieri | David Graff | Abby Neely | Kevin Walker
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The original Mixer corpus was designed to satisfy developing commercial and forensic needs. The resulting Mixer corpora, Phases 1 through 5, have evolved to support and increasing variety of research tasks, including multilingual and cross-channel recognition. The Mixer Phases 4 and 5 corpora feature a wider variety of channels and greater variation in the situations under which the speech is recorded. This paper focuses on the plans, progress and results of Mixer 4 and 5.

2007

bib
Linguistic resources in support of various evaluation metrics
Christopher Cieri | Stephanie Strassel | Meghan Lammie Glenn | Lauren Friedman
Proceedings of the Workshop on Automatic procedures in MT evaluation

2006

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.

pdf bib abs
More Data and Tools for More Languages and Research Areas: A Progress Report on LDC Activities
Christopher Cieri | Mark Liberman
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This presentation reports on recent progress the Linguistic Data Consortium has made in addressing the needs of multiple research communities by collecting, annotating and distributing, simplifying access and developing standards and tools. Specifically, it describes new trends in publication, a sample of recent projects and significant improvements to LDC Online that improve access to LDC data especially for those with limited computing support.

Linguistic Data Consortium has recently embarked on an effort to create integrated linguistic resources and related infrastructure for language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program. GALE targets an end-to-end system consisting of three major engines: Transcription, Translation and Distillation. Multilingual speech or text from a variety of genres is taken as input and English text is given as output, with information of interest presented in an integrated and consolidated fashion to the end user. GALE's goals require a quantum leap in the performance of human language technology, while also demanding solutions that are more intelligent, more robust, more adaptable, more efficient and more integrated. LDC has responded to this challenge with a comprehensive approach to linguistic resource development designed to support GALE's research and evaluation needs and to provide lasting resources for the larger Human Language Technology community.

pdf bib abs
Corpus Support for Machine Translation at LDC
Xiaoyi Ma | Christopher Cieri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes LDC's efforts in collecting, creating and processing different types of linguistic data, including lexicons, parallel text, multiple translation corpora, and human assessment of translation quality, to support the research and development in Machine Translation. Through a combination of different procedures and core technologies, the LDC was able to create very large, high quality, and cost-efficient corpora, which have contributed significantly to recent advances in Machine Translation. Multiple translation corpora and human assessment together facilitate, validate and improve automatic evaluation metrics, which are vital to the development of MT systems. The Bilingual Internet Text Search (BITS) and Champollion sentence aligner enable the finding and processing of large quantities of parallel text. All specifications and tools used by LDC and described in the paper are or will be available to the general public.

pdf bib abs
Low-cost Customized Speech Corpus Creation for Speech Technology Applications
Kazuaki Maeda | Christopher Cieri | Kevin Walker
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Speech technology applications, such as speech recognition, speech synthesis, and speech dialog systems, often require corpora based on highly customized specifications. Existing corpora available to the community, such as TIMIT and other corpora distributed by LDC and ELDA, do not always meet the requirements of such applications. In such cases, the developers need to create their own corpora. The creation of a highly customized speech corpus, however, could be a very expensive and time-consuming task, especially for small organizations. It requires multidisciplinary expertise in linguistics, management and engineering as it involves subtasks such as the corpus design, human subject recruitment, recording, quality assurance, and in some cases, segmentation, transcription and annotation. This paper describes LDC's recent involvement in the creation of a low-cost yet highly-customized speech corpus for a commercial organization under a novel data creation and licensing model, which benefits both the particular data requester and the general linguistic data user community.