The Computers in the Humanities Users Group and the Brown University Library present:
Infrastructure for Digital Humanities: Challenges for Computational Linguistics
in Mining Million Book Collections
David Smith
Department of Computer Science
University of Massachusetts, Amherst
2:00 PM Tuesday, March 15
Bopp Room, John Hay Library
Concerted scanning projects are making significant amounts of data —
historical data in particular — increasingly available to readers and
researchers in many disciplines. To make this data useful, researchers
at UMass Amherst are working on improving OCR, language modeling,
multiple-version alignment, syntactic analysis, information
extraction, and information retrieval. I will focus in particular on
inferring the relational structure latent in books: which books or
passages quote, translate, paraphrase, and cite each other? This
research requires improvements in modeling translation and other forms
of similarity, as well as improvements in efficiently comparing large
numbers of passages.
David Smith is a Research Assistant Professor in the Computer Science
Department at the University of Massachusetts, Amherst, where he is
affiliated with the Center for Intelligent Information Retrieval. He
holds a Ph.D. in computer science from Johns Hopkins and an A.B. in
classics from Harvard.