Brown University Library Digital Technologies

Hydra Connect 2016

Last week I attended Hydra Connect 2016 in Boston, with a team of three others from the Brown University Library. Our team consisted of a Repository Developer, Discovery Systems Developer, Metadata Specialist, and Repository Manager. Here are some notes and thoughts related to the conference from my perspective as a repository programmer.

IPFS

There was a poster about IPFS, which is a peer-to-peer hypermedia protocol for creating a distributed web. It’s an interesting idea, and I’d like to look into it more.

APIs and Architecture

There was a lot of discussion about the architecture of Hydra, and Tom Cramer mentioned APIs specifically in his keynote address. In the Brown Digital Repository, we use a set of APIs that clients can access and use from any programming language. This architecture lets us define layers in the repository: the innermost layer is Fedora and Solr, the next layer is our set of APIs, and the outer layer is the Studio UI, image/book viewers, and custom websites built on the BDR data. There is some overlap in our layers (eg. the Studio UI does hit Solr directly instead of going through the APIs), but I still think it improves the architecture to think about these layers and try not to cross multiple boundaries. Besides having clients that are written in python, ruby, and php, this API layer may be useful when we migrate to Fedora 4 – we can use our APIs to communicate with both Fedora 3 and Fedora 4, and any client that only hits the APIs wouldn’t need to be changed to be able to handle content in Fedora 4.

I would be interested in seeing a similar architecture in Hydra-land (note: this is an outsider’s perspective – I don’t currently work on CurationConcerns, Sufia, or other Hydra gems). A clear boundary between “business logic” or processing and the User Interface or data presentation seems like good architecture to me.

Data Modeling and PCDM

Monday was workshop day at Hydra Connect 2016, and I went to the Data Modeling workshop in the morning and the PCDM In-depth workshop in the afternoon. In the morning session, someone mentioned that we shouldn’t have data modeling differences without good reason (ie. does a book in one institution really have to be modeled differently from a book at another institution?). I think that’s a good point – if we can model our data the same way, that would help with interoperability. PCDM, as a standard for how our data objects are modeled, might be great way to promote interoperability between applications and institutions. In the BDR, we could start using PCDM vocabulary and modeling techniques, even while our data is in Fedora 3 and our code is written in Python. I also think it would be helpful to define and document what interoperability should look like between institutions, or different applications at the same institution.

Imitate IIIF?

It seems like the IIIF community has a good solution to image interoperability. The IIIF community has defined a set of APIs, and then it lists various clients and servers that implement those APIs. I wonder if the Hydra community would benefit from more of a focus on APIs and specifications, and then there could be various “Hydra-compliant” servers and clients. Of course, the Hydra community should continue to work on code as well, but a well-defined specification and API might improve the Hydra code and allow the development of other Hydra-compliant code (eg. code in other programming languages, different UIs using the same API, …).

Researchers@Brown Ranks #1 in SEO Analysis

At the 2016 VIVO national conference in August 2016, Anirvan Chatterjee from the University of California, San Francisco gave a presentation on Search Engine Optimization (SEO) — strategies for increasing a site’s ranking in search results. He analyzed 90 Research Networking Systems (RNS) to determine the proportion of faculty profile pages appearing among the top 3 search results on Google. His analysis ranked Researchers@Brown (vivo.brown.edu) #1 out of the 90 sites tested.

Chatterjee’s talk was entitled “The SEO State of the Union 2016: 5 Data-Driven Steps to Make your Research Networking Sites Discoverable by Real Users, Based on Real Google Results”

The report of the research, “RNS SEO 2016: How 90 research networking sites perform on Google” is available here: https://bitly.com/vivoseo

A Note on Policy

My biggest priority as Brown’s first Digital Preservation Librarian is the implementation of a Digital Curation Policy Framework. The workflows and tools I wrote about last week are certainly very important, but without a policy framework, we as a Library will continue to ask the same questions over and over. Without a framework, projects seem ad hoc and feel like you’re repeatedly re-inventing the wheel. During the past year and a half, I’ve asked (and been asked) the same questions over and over again: “what are our access priorities?”, “what level of preservation are we committed to?”, “what are the standards we strive to maintain?”, etc.. A policy framework asks those questions ahead of time and supplies a ruler for assessing the viability or progress of a project.

I assumed framework implementation would need a seven person committee from the very beginning; we would workshop an initial draft as a group and pass it around the Library for feedback. Thankfully, some colleagues of mine also recognized the need for policy and suggested a different path. Rather than assemble a large committee as step 1, three of us put together the initial draft using the DPM Workshop’s Model Document as a guide. This way we could get something in black and white that people could comment on and revise. We could avoid having too many cooks in the kitchen as we put something together from scratch.

I’m glad we built the initial draft this way. Once we had something on paper, the three of us went through the document section by section and got a clearer view of its breadth. We noted specific decisions outside the purview of our individual jobs, and we listed a series of specific questions that will be useful conversation starters as we pass the draft around. We’re now in the very early stages of soliciting external feedback and plotting a path forward for further revisions, but I’m hopeful that, once implemented, the policy will live as an elastic document that bolsters and informs decision-making across the Library and University.

Workflows and Tools

Digital preservation is simultaneously a new and old topic. So many libraries and archives are only now dipping their toes into these complicated waters, even though the long-term preservation of our born-digital and digitized holdings has been a concern for a while now. I think it is often forgotten that trustworthy standard-bearers, like the Digital Preservation Management Workshop and The Open Archival Information System (OAIS) Model, have been around for over a decade. The OAIS Reference Model in particular is a great resource, but it can be intimidating. Full implementation requires a specific set of resources, which not all institutions have. In this way, comparing one’s own program to another which is further along in an attempt to emulate their progress is often a frustrating endeavor.

I’ve witnessed this disparity most notably at conferences. Conferences, unconferences, and colloquia can be really helpful in that people are (thankfully) very open with their workflows and documentation. It’s one of my favorite things about working in a library; there aren’t trade secrets, and there isn’t an attitude of competition. We celebrate each other’s successes and want to help one another. With that said, some of the conversations at these events are often diluted with tool comparison and institution-specific jargon. The disparity of resources can make these conversations frustrating. How can I compare our fledgling web archiving initiative with other institutions who have entire web archiving teams? Brown has a robust and well-supported Fedora repository, but what about institutions who are in the early stages of implementing a system like that? How do we share and develop ideas about what our tools should be doing if our conversations center around the tools themselves?

For our digital accession workflow, I’ve taken a different approach than what came naturally at first. I initially planned our workflow around the implementation of Artefactual’s Archivematica, but I could never get a test instance installed adequately. This, of course, did not stop the flow of digitized and born-digital material in need of processing. I realized I was trying to plan around the tool, when I wasn’t even sure what I needed to tool to do. Technology will inevitably change, and unless we have a basis for why a tool was implemented, it will be very difficult to navigate that change.

For this reason, I’ve been working on a high-level born-digital accessioning workflow where I can insert or take out tools as needed (see above). This workflow outlines the basic procedures of stabilizing, documenting, and packaging content for long-term storage. It has also been a good point of discussion among both internal and external colleagues. For example, after sharing this diagram on Twitter, someone suggested creating an inventory before running a virus scan. When I talked about this in our daily stand-up meeting, one of the Library’s developers mentioned that compressed folders may in fact strengthen their argument. Unless both the inventory and the virus scan account for items within a compressed folder, there is actually a risk that the scan might miss something. This is one example of the type of conversations I’d like to be having. It’s great to know which tools are available, but focusing strictly on tool implementation keeps us from asking some hard questions.

New Theses & Dissertations site

Last week, we went live with a new site for Electronic Theses and Dissertations.

My part in the planning and coding of the site started back in January, and it was nice to see the site go into production (although we do have more work to do with the new site and shutting down the old one).

Old Site

The old site was written in PHP and only allowed PhD dissertations to be uploaded. It was a multi-step process to ingest the dissertations into the BDR: use a php script to grab the information from the database and turn it into MODS, split and massage the MODS data as needed, map the MODS data files to the corresponding PDF, and run a script to ingest the dissertation into the BDR. The process worked, but it could be improved.

New Site

The new site is written in Python and Django. It now allows for Masters theses as well as PhD dissertations to be uploaded. Ingesting the theses and dissertations into the BDR will be a simple process of selecting the theses/dissertations in the Django admin when they are ready to ingest, and running the ingest admin action – the site will know how to ingest the theses and dissertations into the BDR in the correct format.

ORCID: Unique IDs for Brown Researchers

The Library is coordinating an effort to introduce ORCID identifiers to the campus. ORCID is an open, non-profit initiative founded by academic institutions, professional bodies, funding agencies, and publishers to resolve authorship confusion in scholarly work. The ORCID repository of unique scholar identification numbers aims to reliably identify and link scholars in all disciplines with their work, analogous to the way ISBN and DOI identify books and articles.

Brown is an institutional member of ORCID, which allows the University to create ORCID records on behalf of faculty and to integrate ORCID identifiers into the Brown Identity Management System, Researchers@Brown profiles, grant application processes, and other systems that facilitate identification of faculty and their works.

Please go to https://library.brown.edu/orcid to obtain an ORCID identifier OR, if you already have an ORCID, to link it to your Brown identity.

Please contact researchers@brown.edu if you have questions or feedback.

New ORCID Integrations

MIT Libraries have created an ORCID integration that allows their faculty to link an existing ORCID iD to their MIT profile or create a new ORCID record, which then populates the ORCID record with information about their employment at MIT
University of Pittsburgh is generating ORCID records for their researchers and adding their University of Pittsburgh affiliation

ORCID and the Humanities

ORCID recently announced integration with the MLA International Bibliography.

We are delighted to announce that, as of June 17, the Modern Language Association’s prestigious MLA International Bibliography connects to ORCID. The Bibliography joins other repositories in supporting discoverability through use of digital identifiers, and is the first primarily focused on the humanities to integrate ORCID.

http://orcid.org/blog/2015/06/17/humanists-rejoice-mla-international-bibliography-now-connects-orcid

Search relevancy tests

We are creating a set of relevancy tests for the library’s Blacklight implementation. These tests use predetermined phrases to search Solr, Blacklight’s backend, mimicking the results a user would retrieve. This provides useful data that can be systematically analyzed. We use the results of these tests to verify that users will get the results we, as application managers and librarians, expect. It also will help us protect against regressions, or new, unexpected problems, when we make changes over time to Solr indexing schema or term weighting.

This work is heavily influenced by colleagues at Stanford who have both written about their (much more thorough at this point) relevancy tests and developed a Ruby Gem to assist others with doing similar work.

We are still working to identify common and troublesome searches but have already seen benefits of this approach and used it to identify (and resolve) deficiencies in title weighting and searching by common identifiers, among other issues. Our test code and test searches are available on Github for others to use as an example or to fork and apply to their own project.

Brown library staff who have examples of searches not producing expected results, please pass them on to Jeanette Norris or Ted Lawless.

— Jeanette Norris and Ted Lawless

Best bets for library search

The library has added “best bets” to the new easySearch tool. Best bets are commonly searched for library resources. Examples include JSTOR, Pubmed, and Web of Science. Searches for these phrases (as well as known alternate names and misspellings) will return a best bet highlighted at the top of the search results.

To get started, 64 resources have been selected as best bets and are available now via easySearch. As we would like to know how useful this feature is, please leave us feedback.

Thanks to colleagues at North Carolina State University for leading the way in adding best bets to library search and writing about their efforts.

Technical details

Library staff analyzed search logs to find commonly used search terms and matched those terms to appropriate resources. The name, url, and description for each resource is entered into a shared Google Spreadsheet. A script runs regularly to convert the spreadsheet data into Solr documents and posts the updates to a separate Solr core. The Blacklight application searches for best bet matches when users enter a search into the default search box.

Since the library maintains a database of e-resources, in many cases only the identifier for a resource is needed to populate the best bets index. The indexing script is able to retrieve the resource from the database and use that information to create the best bet. This eliminates maintaining data about the resources in multiple places.