Skip to page navigation menu Skip entire header
Brown University
Skip 14 subheader links

Center for Digital Scholarship

Infrastructure for Digital Humanities: Challenges for Computational Linguistics in Mining Million Book Collections

The Computers in the Humanities Users Group and the Brown University Library present:
Infrastructure for Digital Humanities: Challenges for Computational Linguistics
in Mining Million Book Collections

David Smith
Department of Computer Science
University of Massachusetts, Amherst

2:00 PM Tuesday, March 15
Bopp Room, John Hay Library

Concerted scanning projects are making significant amounts of data —
historical data in particular — increasingly available to readers and
researchers in many disciplines. To make this data useful, researchers
at UMass Amherst are working on improving OCR, language modeling,
multiple-version alignment, syntactic analysis, information
extraction, and information retrieval. I will focus in particular on
inferring the relational structure latent in books: which books or
passages quote, translate, paraphrase, and cite each other? This
research requires improvements in modeling translation and other forms
of similarity, as well as improvements in efficiently comparing large
numbers of passages.

David Smith is a Research Assistant Professor in the Computer Science
Department at the University of Massachusetts, Amherst, where he is
affiliated with the Center for Intelligent Information Retrieval. He
holds a Ph.D. in computer science from Johns Hopkins and an A.B. in
classics from Harvard.