Brown University Homepage Brown University Library

Email Preservation

Email Preservation is one of many new initiatives I’ve taken on at Brown. As it often happens, this work’s been put into motion by necessity. Last year Brown took in the collection of a well-known 2nd Amendment activist who recently passed away. Part of the collection included his blog and his email.

Although email has been around for awhile, preservation strategies at libraries and archives are still emerging. Some archives simply print out emails and save them as they would any other paper record. This is a better strategy than no strategy at all, but it does not really honor the rich complexity of its original digital form.

Most email clients will export mailboxes as an MBOX file (including Google Takeout). MBOX files are stable for preservation purposes in that they are essentially (very) large blocks of text. They are, however, not good for processing or access purposes (see Figure 1).

(Figure 1: An encoded email attachment in an MBOX file)

A few years ago I worked on an email preservation project that used the National Archives of Australia’s Xena to parse MBOX files into XML. One MBOX file can contain multiple messages, and Xena parsed those messages into individual XML files that were readable in an Internet browser. This is a step in the right direction, but the text still isn’t very searchable or discoverable.

By the time this particular email collection came into Brown’s possession, Stanford’s ePADD project was new on the scene. EPADD uses Natural Language Processing to build a searchable index of topics, correspondents, and places to aid in processing and access. The name stands for its four essential functions: Processing, Appraisal, Discovery, and Delivery. The first two modules allow archivists to cull through email by correspondent, attachment, topic, etc. to redact or restrict messages as needed. Once the collection’s been processed, archivists can either deliver the collection to researchers on a local machine or make the collection discoverable online.

Our foray into email preservation is another great example of the essential partnership between my office in Digital Technologies and staff in Archives and Special Collections. I make heads and tails of the tool’s features, but I rely on my colleagues to guide an implementation that is in keeping with their principles and policies. In this case, I’ve been instructed that this collection’s scope does not include personal correspondence between family members. I can use the appraisal and processing modules to sort through emails and restrict based off this directive.

In Progress: The Mark Baumer Digital Collection

This is a guest post by Brown University Library’s Web Archiving Intern, Christina Cahoon. Christina is currently finishing her Masters of Library and Information Science degree at the University of Rhode Island. 

After the recent passing of Brown University alumnus and Library staff member Mark Baumer MFA ‘11, the Brown University Library tasked itself with preserving his prolific web presence. I’m working towards that goal with Digital Preservation Librarian, Kevin Powell.  Baumer was a poet and environmental activist who worked within the Digital Technologies Department as Web Content Specialist.  This past October, Baumer began his Barefoot Across America campaign, with plans to walk barefoot from Rhode Island to California in an effort to raise money for environmental preservation and to support the FANG Collective.  Unfortunately, this journey was tragically cut short on January 21, 2017, when Baumer was struck by a vehicle and killed while walking along a highway in Florida.

Baumer was an avid social media user who posted on several platforms multiple times a day.  As such, the task of recording and archiving Baumer’s web presence is quite large and not free from complications.  Currently, we are using Archive-It to crawl Baumer’s social media accounts and news sites containing coverage of Baumer’s campaign, including notices of his passing. While Archive-It does a fairly decent job recording news sites, it encounters various issues when attempting to capture social media content, including content embedded in news articles.  As you can imagine, this is causing difficulties capturing the bulk of Baumer’s presence on the web.

Archive-It’s help center has multiple suggestions to aid in capturing social media sites that have proven useful when capturing Baumer’s Twitter feed; however, suggestions have either not been helpful or are non-existent when it comes to other social media sites like YouTube, Instagram, and Medium.  The issues faced with crawling these websites range from capturing way too much information, as in the case with YouTube where our tests captured every referred video file from every video in the playlist, to capturing only the first few pages of dynamically loading content, as is the case with Instagram and Medium. We are re-configuring our approach to YouTube after viewing Archive-It’s recently-held Archiving Video webinar, but unfortunately the software does not have solutions for Instagram and Medium at this time.  

These issues have caused us to re-evaluate our options for best methods to capture Baumer’s work.  We have tested how WebRecorder works in capturing sites like Flickr and Instagram and we are still encountering problems where images and videos are not being captured.  It seems as though there will not be one solution to our problem and we will have to use multiple services to sufficiently capture all of Baumer’s social media accounts.

The problems encountered in this instance are not rare in the field of digital preservation.  Ultimately, we must continue testing different preservation methods in order to find what works best in this situation.  It is likely we will need to use multiple services in order to capture everything necessary to build this collection.  As for now, the task remains of discovering the best methods to properly capture Baumer’s work.

What I Learned in Milwaukee

From Monday to Wednesday I had the great privilege of attending the annual Digital Library Federation Forum. DLF is the brand new host of the National Digital Stewardship Alliance, which convened its first annual meeting under new leadership immediately following the Forum. Four days, two conferences, and one election later, I’d like to share some reflections about my work and the work of others in my field. A bit of a disclaimer: between this year’s election results and two powerful keynote addresses, the week was emotionally charged for myself and many of my colleagues. We continually came back to the idea of care and inclusivity in our profession and the danger of idealizing neutrality. I’d like to clearly state that any opinions I express on this post, no matter how obliquely, do not necessarily reflect the official stance or policies of Brown University or the Library.

The DLF forum and NDSA meeting certainly saw their share of tool & workflow chatter. Two librarians from the University of Miami spoke on creating rights statements for 52,000 objects in their ContentDM repository. It was not a full-time project for either of them, so their presentation doubled as a master class in project management. They diligently built a matrix for assessing each object then used that tool over the course of a year to assign statements. Project management came up over and over. Two librarians, one from the University of Iowa and another from Emerson College, detailed their experiences in digital initiatives and shared their own PM techniques. They underscored the importance of relationship in the workplace, especially when managing and encouraging the work of coworkers they did not supervise.

It was surprising to me how many people were willing to get vulnerable about their work without openly eliciting pity. The most affecting presentation in that regard came from an archivist who attempted to accession the email archive of a defunct non-profit organization. Although an authority figure from that organization encouraged the accession, former employees were troubled by the idea. Apparently a select number of employees used a secret listserv to correspond privately about confidential matters. Employees had also used their professional email for delicate personal matters. The presenter tried to adjust the scope of the accession, but ultimately abandoned the project. We often learn about our colleagues’ most notable successes at conferences, but hearing a story of failure was empowering and educational. Hindsight is always 20/20, and her willingness to share some of those lessons meant a lot.

Vulnerability, in a lot of ways, feels like the antithesis of professionalism. We’re supposed to stay neutral and focus on the work, which should, itself, be neutral. But as the two keynotes I saw so clearly outlined, we are affected by our work and our work affects others. On Monday morning, Stacie Williams’s DLF keynote outlined how work and care are seen as separate acts, when often they are inextricably bound. Preserving information and delivering it to those who seek its edification, she argued, is an act of care. It’s impossible to talk about inclusion and diversity in our workforce or collections without recognizing the care involved with that work. Williams’s talk was in direct dialogue with Bergis Jules’s NDSA keynote, who spoke on Wednesday.

Jules’s words came the afternoon after the U.S. election, which had visibly affected the crowd. He spoke on care in libraries and archives and insisted that historical erasure is an act of violence. Jules played an interview with Reina Gossett, a Black trans artist and activist, who spoke on the historical isolation she felt from other trans women of color. This isolation led her to the archives and motivated her to make the movie “Happy Birthday Marsha!” about trans pioneer Marsha P. Johnson. Jules drew a direct line from Gossett’s historical isolation to the epidemic of Black trans murders in 2016.

“We have to ask ourselves, what do we owe these victims and the trans community, as fellow humans, as archivists, as culture keepers, and as people who’ve charged ourselves with deciding who gets remembered and who doesn’t? What do we owe communities that are constantly victimized because of erasure and by erasure?”

Saving these legacies, Jules said, is made complicated by prioritizing standards and technology over human relationships. Although standards are important, they can lead to elitism and exclusion.

“The more selective and specialized space of digital collections, prioritizes professionalism, technical expertise, and standards, over a critical interrogation of the cultural character of our records. So this is certainly an appropriate venue to ask questions about the diversity represented in our historical records. Because for digital collections, who gets represented is closely tied to who writes the software, who builds the tools, who produces the technical standards, and who provides the funding or other resources for that work.”

Our profession tells itself we should remain neutral and that #AllLivesMatter, but without active collecting of marginalized communities, how can we ensure that collecting around a white, straight, cismale paradigm won’t persist?

Many of the people gathered there (myself included) were already feeling dread that the election outcome made vulnerable populations more vulnerable, and so Jules’s words were especially profound. After his talk, a librarian stood up and expressed concern that his brand new green card, his brand new husband, and the cultural heritage job he loves so much would all be taken away from him. It was a deeply affecting moment, and it was heartening to see the care Williams and Jules spoke about shown to him inside and outside of the ballroom.

So what now? I’m inspired by Samantha Abrams work at the University of Wisconsin and the emerging Doc Now project to rethink my role at Brown and the broader community. Now that we’re done talking, it’s time to get to work.

PASIG 2016 – Goodbye, Vine

So, Vine announced its shutdown while I was attending a digital preservation conference.

The Preservation and Archiving Special Interest Group meeting covered a lot of topics, with the first two days leaning towards technical work. We learned about data preservation initiatives, the infrastructure of The Smithsonian’s very own enterprise DAMS (!), and how MoMA is looking to digitize its film collection which, in total at 4K, would equal 80 Petabytes (!!!). Right before lunch on the second day, the news of Vine’s demise circulated all over the #pasignyc hashtag. I participated in a conversation on curation (“Do we need to save *all* the Vines? How do we choose which ones? Is ‘virality’ an elitist metric?”). Someone even estimated the size of a complete Vine archive. At the same time, non-archivists I pay attention to were reeling from the news. Sam Sanders, a political reporter at NPR, wrote in a series of tweets that Black users of Vine innovated and dominated the medium, and did so without suffering much of the abuse seen on other platforms like Twitter. Towards the end of this series, he said:

This ended up being completely relevant to the meeting’s third day. Archivists talked about working to include marginalized voices in digital preservation initiatives. One presenter talked about how “scrubbing” language of its diacritics ruined the context of a collection and showed a Western bias. Another discussed transnational partnerships between Western and non-Western institutions then recommended strategies for mindful and respectful collaborations.

I couldn’t help but think about Sanders’ tweet during these presentations. We as librarians and archivists are the experts in preservation, yes, but it’s important to remember that humans use computers, not the other way around (even if it feels that way sometimes). Mindful stewardship isn’t simply committing to long-term preservation of objects.

A Note on Policy

My biggest priority as Brown’s first Digital Preservation Librarian is the implementation of a Digital Curation Policy Framework. The workflows and tools I wrote about last week are certainly very important, but without a policy framework, we as a Library will continue to ask the same questions over and over. Without a framework, projects seem ad hoc and feel like you’re repeatedly re-inventing the wheel. During the past year and a half, I’ve asked (and been asked) the same questions over and over again: “what are our access priorities?”, “what level of preservation are we committed to?”, “what are the standards we strive to maintain?”, etc.. A policy framework asks those questions ahead of time and supplies a ruler for assessing the viability or progress of a project.

I assumed framework implementation would need a seven person committee from the very beginning; we would workshop an initial draft as a group and pass it around the Library for feedback. Thankfully, some colleagues of mine also recognized the need for policy and suggested a different path. Rather than assemble a large committee as step 1, three of us put together the initial draft using the DPM Workshop’s Model Document as a guide. This way we could get something in black and white that people could comment on and revise. We could avoid having too many cooks in the kitchen as we put something together from scratch.

I’m glad we built the initial draft this way. Once we had something on paper, the three of us went through the document section by section and got a clearer view of its breadth. We noted specific decisions outside the purview of our individual jobs, and we listed a series of specific questions that will be useful conversation starters as we pass the draft around. We’re now in the very early stages of soliciting external feedback and plotting a path forward for further revisions, but I’m hopeful that, once implemented, the policy will live as an elastic document that bolsters and informs decision-making across the Library and University.

Workflows and Tools

Digital preservation is simultaneously a new and old topic. So many libraries and archives are only now dipping their toes into these complicated waters, even though the long-term preservation of our born-digital and digitized holdings has been a concern for a while now. I think it is often forgotten that trustworthy standard-bearers, like the Digital Preservation Management Workshop and The Open Archival Information System (OAIS) Model, have been around for over a decade. The OAIS Reference Model in particular is a great resource, but it can be intimidating. Full implementation requires a specific set of resources, which not all institutions have. In this way, comparing one’s own program to another which is further along in an attempt to emulate their progress is often a frustrating endeavor.

I’ve witnessed this disparity most notably at conferences. Conferences, unconferences, and colloquia can be really helpful in that people are (thankfully) very open with their workflows and documentation. It’s one of my favorite things about working in a library; there aren’t trade secrets, and there isn’t an attitude of competition. We celebrate each other’s successes and want to help one another. With that said, some of the conversations at these events are often diluted with tool comparison and institution-specific jargon. The disparity of resources can make these conversations frustrating. How can I compare our fledgling web archiving initiative with other institutions who have entire web archiving teams? Brown has a robust and well-supported Fedora repository, but what about institutions who are in the early stages of implementing a system like that? How do we share and develop ideas about what our tools should be doing if our conversations center around the tools themselves?

For our digital accession workflow, I’ve taken a different approach than what came naturally at first. I initially planned our workflow around the implementation of Artefactual’s Archivematica, but I could never get a test instance installed adequately. This, of course, did not stop the flow of digitized and born-digital material in need of processing. I realized I was trying to plan around the tool, when I wasn’t even sure what I needed to tool to do. Technology will inevitably change, and unless we have a basis for why a tool was implemented, it will be very difficult to navigate that change.

accessioning-workflow

For this reason, I’ve been working on a high-level born-digital accessioning workflow where I can insert or take out tools as needed (see above). This workflow outlines the basic procedures of stabilizing, documenting, and packaging content for long-term storage. It has also been a good point of discussion among both internal and external colleagues. For example, after sharing this diagram on Twitter, someone suggested creating an inventory before running a virus scan. When I talked about this in our daily stand-up meeting, one of the Library’s developers mentioned that compressed folders may in fact strengthen their argument. Unless both the inventory and the virus scan account for items within a compressed folder, there is actually a risk that the scan might miss something. This is one example of the type of conversations I’d like to be having. It’s great to know which tools are available, but focusing strictly on tool implementation keeps us from asking some hard questions.