Tony McEnery


2014

pdf bib
Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora
Irina Temnikova | William A. Baumgartner Jr. | Negacy D. Hailu | Ivelina Nikolova | Tony McEnery | Adam Kilgarriff | Galia Angelova | K. Bretonnel Cohen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Sublanguages are varieties of language that form “subsets” of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

2004

pdf bib
Evaluating Lexical Resources for a Semantic Tagger
Scott S. L. Piao | Paul Rayson | Dawn Archer | Tony McEnery
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17th and 19th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% -- 97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them ‘future proof’, we need to evaluate their potential both synchronically and diachronically across genres.

2003

pdf bib
Extracting Multiword Expressions with A Semantic Tagger
Scott S. L. Piao | Paul Rayson | Dawn Archer | Andrew Wilson | Tony McEnery
Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment

2002

pdf bib
A Unicode-based Environment for Creation and Use of Language Resources
Valentin Tablan | Cristian Ursu | Kalina Bontcheva | Hamish Cunningham | Diana Maynard | Oana Hamza | Tony McEnery | Paul Baker | Mark Leisher
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation
Paul Baker | Andrew Hardie | Tony McEnery | Hamish Cunningham | Rob Gaizauskas
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Ethical and legal issues in corpus construction
Tony McEnery
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
Corpus Resources and Minority Language Engineering
Tony McEnery | Paul Baker | Lou Burnard
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1997

pdf bib
Corpus annotation and reference resolution
Tony McEnery | Izumi Tanaka | Simon Botley
Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts