Michael Maxwell


2022

pdf bib
You’ve translated it, now what?
Michael Maxwell | Shabnam Tafreshi | Aquia Richburg | Balaji Kodali | Kymani Brown
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)

Humans use document formatting to discover document and section titles, and important phrases. But when machines process a paper–especially documents OCRed from images–these cues are often invisible to downstream processes: words in footnotes or body text are treated as just as important as words in titles. It would be better for indexing and summarization tools to be guided by implicit document structure. In an ODNI-sponsored project, ARLIS looked at discovering formatting in OCRed text as a way to infer document structure. Most OCR engines output results as hOCR (an XML format), giving bounding boxes around characters. In theory, this also provides style information such as bolding and italicization, but in practice, this capability is limited. For example, the Tesseract OCR tool provides bounding boxes, but does not attempt to detect bold text (relevant to author emphasis and specialized fields in e.g. print dictionaries), and its discrimination of italicization is poor. Our project inferred font size from hOCR bounding boxes, and using that and other cues (e.g. the fact that titles tend to be short) determined which text constituted section titles; from this, a document outline can be created. We also experimented with algorithms for detecting bold text. Our best algorithm has a much improved recall and precision, although the exact numbers are font-dependent. The next step is to incorporate inferred structure into the output of machine translation. One way is to embed XML tags for inferred structure into the text extracted from the imaged document, and to either pass the strings enclosed by XML tags to the MT engine individually, or pass the tags through the MT engine without modification. This structural information can guide downstream bulk processing tasks such as summarization and search, and also enables building tables of contents for human users examining individual documents.

2017

pdf bib
STREAMLInED Challenges: Aligning Research Interests with Shared Tasks
Gina-Anne Levow | Emily M. Bender | Patrick Littell | Kristen Howell | Shobhana Chelliah | Joshua Crowgey | Dan Garrette | Jeff Good | Sharon Hargus | David Inman | Michael Maxwell | Michael Tjalve | Fei Xia
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Endangered Data for Endangered Languages: Digitizing Print dictionaries
Michael Maxwell | Aric Bills
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2016

bib
Did You Mean...? and Dictionary Repair: from Science to Engineering
Michael Maxwell | Petra Bradley
Conferences of the Association for Machine Translation in the Americas: MT Users' Track

2015

pdf bib
Accounting for Allomorphy in Finite-state Transducers
Michael Maxwell
Proceedings of the 12th International Conference on Finite-State Methods and Natural Language Processing 2015 (FSMNLP 2015 Düsseldorf)

2008

pdf bib
Lexicon Schemas and Related Data Models: when Standards Meet Users
Thorsten Trippel | Michael Maxwell | Greville Corbett | Cambell Prince | Christopher Manning | Stephen Grimes | Steve Moran
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Lexicon schemas and their use are discussed in this paper from the perspective of lexicographers and field linguists. A variety of lexicon schemas have been developed, with goals ranging from computational lexicography (DATR) through archiving (LIFT, TEI) to standardization (LMF, FSR). A number of requirements for lexicon schemas are given. The lexicon schemas are introduced and compared to each other in terms of conversion and usability for this particular user group, using a common lexicon entry and providing examples for each schema under consideration. The formats are assessed and the final recommendation is given for the potential users, namely to request standard compliance from the developers of the tools used. This paper should foster a discussion between authors of standards, lexicographers and field linguists.

pdf bib
Joint Grammar Development by Linguists and Computer Scientists
Michael Maxwell | Anne David
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

2004

pdf bib
Morphological Interfaces to Dictionaries
Michael Maxwell | William Poser
Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries

2000

pdf bib
Book Reviews: A Grammar Writer’s Cookbook
Michael Maxwell
Computational Linguistics, Volume 26, Number 2, June 2000

1994

pdf bib
Parsing Using Linearly Ordered Phonological Rules
Michael Maxwell
Computational Phonology

1991

pdf bib
Phonological Analysis and Opaque Rule Orders
Michael Maxwell
Proceedings of the Second International Workshop on Parsing Technologies

General morphological/phonological analysis using ordered phonological rules has appeared to be computationally expensive, because ambiguities in feature values arising when phonological rules are “un-applied” multiply with additional rules. But in fact those ambiguities can be largely ignored until lexical lookup, since the underlying values of altered features are needed only in the case of rare opaque rule orderings, and not always then.