Brown University Homepage Brown University Library

Email Preservation

Email Preservation is one of many new initiatives I’ve taken on at Brown. As it often happens, this work’s been put into motion by necessity. Last year Brown took in the collection of a well-known 2nd Amendment activist who recently passed away. Part of the collection included his blog and his email.

Although email has been around for awhile, preservation strategies at libraries and archives are still emerging. Some archives simply print out emails and save them as they would any other paper record. This is a better strategy than no strategy at all, but it does not really honor the rich complexity of its original digital form.

Most email clients will export mailboxes as an MBOX file (including Google Takeout). MBOX files are stable for preservation purposes in that they are essentially (very) large blocks of text. They are, however, not good for processing or access purposes (see Figure 1).

(Figure 1: An encoded email attachment in an MBOX file)

A few years ago I worked on an email preservation project that used the National Archives of Australia’s Xena to parse MBOX files into XML. One MBOX file can contain multiple messages, and Xena parsed those messages into individual XML files that were readable in an Internet browser. This is a step in the right direction, but the text still isn’t very searchable or discoverable.

By the time this particular email collection came into Brown’s possession, Stanford’s ePADD project was new on the scene. EPADD uses Natural Language Processing to build a searchable index of topics, correspondents, and places to aid in processing and access. The name stands for its four essential functions: Processing, Appraisal, Discovery, and Delivery. The first two modules allow archivists to cull through email by correspondent, attachment, topic, etc. to redact or restrict messages as needed. Once the collection’s been processed, archivists can either deliver the collection to researchers on a local machine or make the collection discoverable online.

Our foray into email preservation is another great example of the essential partnership between my office in Digital Technologies and staff in Archives and Special Collections. I make heads and tails of the tool’s features, but I rely on my colleagues to guide an implementation that is in keeping with their principles and policies. In this case, I’ve been instructed that this collection’s scope does not include personal correspondence between family members. I can use the appraisal and processing modules to sort through emails and restrict based off this directive.