Uncategorized – Brown University Library Digital Technologies

Migration: Fedora 3 to OCFL

A previous post described the current storage setup of the Brown Digital Repository. However, until recently BDR storage was handled by Fedora 3. This post will describe how we migrated over one million objects from Fedora 3 to OCFL, without taking down the repository.

The first step was to isolate the storage backend behind our BDR APIs and a Django storage service (this idea wasn’t new to the migration – we’ve been working on our API layer for years, long before the migration started). So, users and client applications did not hit Fedora directly – they went through the storage service or the APIs for reads and writes. This let us contain the storage changes to just the APIs and storage service, without needing to update the other various applications that interacted with the BDR.

For our migration, we decided to set up the new OCFL system while Fedora 3 was still running, and run them both in parallel during the migration. This would minimize the downtime, and the BDR would not be unavailable or read-only for days or weeks while the migration script migrated our ~1 million Fedora 3 objects. We set up our OCFL HTTP service layer, and updated our APIs to be able to post new objects to OCFL and update objects either in Fedora or OCFL. We also updated our storage service to check for an object in OCFL, and if the object hadn’t been migrated to OCFL yet, the storage service would fall back to reading from Fedora. Once these changes were enabled, new objects were posted to OCFL and updated there, and old objects in Fedora were updated in Fedora. At this point, object files could change in Fedora, but we had a static set of Fedora objects to migrate.

The general plan for migrating was to divide the objects into batches, and migrate each batch individually. We mounted our Fedora storage a second time on the server as read-only, so the migration process would not be able to write to the Fedora data. We used a small custom script to walk the whole Fedora filesystem, listing all the object pids in 12 batch files. For each batch, we used our fork of the Fedora community’s migration-utils program to migrate the Fedora data over to OCFL. We migrated to plain OCFL, however, instead of creating Fedora 6 objects. We also chose to migrate the whole Fedora 3 FOXML file, and not store the object and datastream properties in small RDF files. If the object was marked as ‘D’eleted in Fedora, we marked it as deleted in OCFL by deleting all the files in the final version of the object. After the batch was migrated, we checked for errors.

One issue we ran into was slowness – one batch of 100,000 objects could take days to finish. This was much slower than a dev server migration, where we migrated over 30,000 objects in ~1.25 hours. We could have sped up the process by turning off fixity checking, but we wanted to make sure the data was being migrated correctly. We added memory to our server, but that didn’t help much. Eventually, we used four temporary servers to run multiple migration batches in parallel, which helped us finish the process faster.

We had to deal with another kind of issue where objects were updated in Fedora during the migration batch run (because we kept Fedora read-write during the migration). In one batch, we had 112 of the batch objects updated in Fedora. The migration of these objects failed completely, so we just needed to add the object PIDs to a cleanup batch, and then they were successfully migrated.

The migration script failed to migrate some objects because the Fedora data was corrupt – ie. file versions listed in the FOXML were not on disk, or file versions were listed out-of-order in the FOXML. We used a custom migration script for these objects, so we could still migrate the existing Fedora filesystem files over to OCFL.

Besides the fixity checking that the migration script performed as it ran, we also ran some verification checks after the migration. From our API logs, we verified that the final object update in Fedora was on 2021-05-20. On 2021-06-22, we kicked off a script that took all the objects in the Fedora storage and verified that each object’s Fedora FOXML file was identical to the FOXML file in OCFL (except for some objects that didn’t need to be migrated). Verifying all the FOXML files shows that the migration process was working correctly, that every object had been migrated, and that there were no missed updates to the Fedora objects – because any Fedora object update would change the FOXML. Starting on 2021-06-30, we took the lists of objects that had an error during the migration and used a custom script to verify that each of the files for that object on the Fedora filesystem was also in OCFL, and the checksums matched.

Once all the objects were migrated to OCFL, we could start shutting down the Fedora system and removing code for handling both systems. We updated the APIs and the storage service to remove Fedora-related code, and we were able to update our indexer, storage service, and Cantaloupe IIIF server to read all objects directly from the OCFL storage. We shut down Fedora 3, and did some cleanup on the Fedora files. We also saved the migration artifacts (notes, data, scripts, and logs) in the BDR to be preserved.

BDR Storage Architecture

We recently migrated the Brown Digital Repository (BDR) storage from Fedora 3 to OCFL. In this blog post, I’ll describe our current setup, and then a future post will discuss the process we used to migrate.

Our BDR data is currently stored in an OCFL repository¹. We like having a standardized, specified layout for the files in our repository – we can use any software written for OCFL, or we can write it ourselves. Using the OCFL standard should also help us minimize data migrations in the future, as we won’t need to switch from one application’s custom file layout to a new application’s custom layout. OCFL repositories can be understood just from the files on disk, and databases or indexes can be rebuilt from those files. Backing up the repository only requires backing up the filesystem – there’s no metadata stored in a separate database. OCFL also has versioning and checksums built in for every file in the repository. OCFL gives us an algorithm to find any object on disk (and all the object files are contained in that one object directory), which is much nicer than our previous Fedora storage where objects and files were hard to find because they were spread out in various directories based on the date they were added to the BDR.

In the BDR, we’re storing the data on shared enterprise storage, accessed over NFS. We use an OCFL storage layout extension that splits the objects into directory trees, and encapsulates the object files in a directory with a name based on the object ID. We wrote an HTTP service for the OCFL-java client used by Fedora 6. We use this HTTP service for writing new objects and updates to the repository – this service is the only process that needs read-write access to the data.

We use processes with read-only access (either run by a different user, or on a different server with a read-only mount) to provide additional functionality. Our fixity checking script walks part of the BDR each night and verifies the checksums listed in the OCFL inventory.json file. Our indexer process reads the files in a object, extracts data, and posts the index data to Solr. Our Django-based storage service reads files from the repository to serve the content to users. Each of these services uses our bdrocfl package, which is not a general OCFL client – it contains code for reading our specific repository, with our storage layout and reading the information we need from our files. We also run the Cantaloupe IIIF image server, and we added a custom jruby delegate with some code that knows how to find an object and then a file within the object.

We could add other read-only processes to the BDR in the future. For example, we could add a backup process that crawls the repository, saving each object version to a different storage location. OCFL versions are immutable, and that would simplify the backup process because we would only have to back up new version directories for each object.

1. Collection information like name, description, … is actually stored in a separate database, but hopefully we will migrate that to the OCFL repository soon.

PyPI packages

Recently, we published two Python packages to PyPI: bdrxml and bdrcmodels. No one else is using those packages, as far as I know, and it takes some effort to put them up there, but there are benefits from publishing them.

Putting a package on PyPI makes it easier for other code we package up to depend on bdrxml. For our indexing package, we can switch from this:

‘bdrxml @ https://github.com/Brown-University-Library/bdrxml/archive/v1.0a1.zip#sha1=5802ed82ee80a9627657cbb222fe9c056f73ad2c’,

to this:

‘bdrxml>=1.0’,

in setup.py, which is simpler. This also lets us using Python’s package version checking to not pin bdrxml to just one version, which is helpful when we embed the indexing package in another project that may use a different version of bdrxml.

Publishing these first two packages also gave us experience, which will help if we publish more packages to PyPI.

Exporting Django data

We recently had a couple cases where we wanted to dump the data out of a Django database. In the first case (“tracker”), we were shutting down a legacy application, but needed to preserve the data in a different form for users. In the second case (“deposits”), we were backing up some obsolete data before removing it from the database. We handled the processes in two different ways.

Tracker

For the tracker, we used an export script to extract the data. Here’s a modified version of the script:

def export_data():
    now = datetime.datetime.now()
    dir_name = 'data_%s_%s_%s' % (now.year, now.month, now.day)
    d = os.mkdir(dir_name)
    file_name = os.path.join(dir_name, 'tracker_items.dat')
    with open(file_name, 'wb') as f:
        f.write(u'\u241f'.join([
                    'project name',
                    'container identifier',
                    'container name',
                    'identifier',
                    'name',
                    'dimensions',
                    'note',
                    'create digital surrogate',
                    'qc digital surrogate',
                    'create metadata record',
                    'qc metadata record',
                    'create submission package']).encode('utf8'))
        f.write('\u241e'.encode('utf8'))
        for project in models.Project.objects.all():
            for container in project.container_set.all():
                print(container)
                for item in container.item_set.all():
                    data = u'\u241f'.join([
                        project.name.strip(),
                        container.identifier.strip(),
                        container.name.strip(),
                        item.identifier.strip(),
                        item.name.strip(),
                        item.dimensions.strip(),
                        item.note.strip()
                    ])
                    item_actions = u'\u241f'.join([str(item_action) for item_action in item.itemaction_set.all().order_by('id')])
                    line_data = u'%s\u241f%s\u241e' % (data, item_actions)
                    f.write(line_data.encode('utf8'))

As you can see, we looped through different Django models and pulled out fields, writing everything to a file. We used the Unicode Record and Unit Separators as delimiters. One advantage of using those is that your data can have commas, tabs, newlines, … and it won’t matter. You still don’t have to quote or escape anything.

Then we converted the data to a spreadsheet that users can view and search:

import openpyxl

workbook = openpyxl.Workbook()
worksheet = workbook.active

with open('tracker_items.dat', 'rb') as f:
    data = f.read()
    lines = data.decode('utf8').split('\u241e')
    print(len(lines))
    print(lines[0])
    print(lines[-1])
    for line in lines:
        fields = line.split('\u241f')
        worksheet.append(fields)
workbook.save('tracker_items.xlsx')

Deposits

For the deposits project, we just used the built-in Django dumpdata command:

python manage.py dumpdata -o data_20180727.dat

That output file could be used to load data back into a database if needed.

Thumbnail cache

The BDR provides thumbnails for image and PDF objects. The thumbnail service is set up to check for a thumbnail in storage, then try to generate one from the IIIF image server, and fall back to an icon if needed. Thumbnails are cached by the thumbnail service for 30 days, up to a maximum of 5000 thumbnails. The maximum number of thumbnails was the limiting factor on the thumbnails in the cache – they weren’t being purged from the cache because they were older than 30 days, but because the cache filled up.

We recently had the need to purge some thumbnails, because we updated the images for some objects. We decided to update the thumbnail caching, so that the cache is timestamped. We already use our API to check for permissions on an object before displaying the thumbnail, and we added an API check for when the object in storage was last modified. If the object was modified more recently than the cache timestamp, the cache is stale and we grab an updated thumbnail. This should keep the thumbnail cache from serving stale images.

Automating Streaming Objects

In the BDR, we can provide streaming versions for audio or video content, in addition to (or instead of) file download. We used Wowza for streaming content in the past, but now we use Panopto.

The process for getting content to Panopto has been manual: download the file from the BDR, upload it to Panopto, set the correct session name in Panopto, and associate the Panopto ID with the BDR record. I’ve been working on automating this process, though.

Here are the steps that the automation code performs:

download audio or video file from BDR to a temporary file
hit the Panopto API to upload file
- create the session and upload in Panopto
- use the Amazon S3 protocol to upload the file
- mark the upload as complete so Panopto starts processing the file
create a streaming object in the BDR with the correct Panopto session ID

We want to make sure that the process can handle large files without running out of memory or taxing the server too much. So, we stream the content to the temporary file in chunks. Then, when we upload the file to Panopto, we’d like to do that in chunks as well, so we’re never reading the whole file into memory – unfortunately, we’re currently running into an error with the multipart upload.

This automation will reduce the amount of manual work we do for streaming content, and could open the door to creating streaming objects automatically on request from non-BDR staff (or even users).

Python 2 => 3

We’ve recently been migrating our code from Python 2 to Python 3. There is a lot of documentation about the changes, but these are changes we had to make in our code.

Print

First, the print statement had to be changed to the print function:

print 'message'

became

print('message')

Text and bytes

Python 3 change bytes and unicode text handling, so here some changes related to that:

json.dumps required a unicode string, instead of bytes, so

json.dumps(xml.serialize())

became

json.dumps(xml.serialize().decode('utf8'))

basestring was removed, so

isinstance("", basestring)

became

isinstance("", str)

This change to explicit unicode and bytes handling affected the way we opened files. In Python 2, we could open and use a binary file, without specifying that it was binary:

open('file.zip')

In Python 3, we have to specify that it’s a binary file:

open('file.zip', 'rb')

Some functions couldn’t handle unicode in Python 2, so in Python 3 we don’t have to encode the unicode as bytes:

urllib.quote(u'tëst'.encode('utf8'))

became

urllib.quote('tëst')

Of course, Python 3 reorganized parts of the standard library, so the last line would actually be:

urllib.parse.quote('tëst')

Dicts

There were also some changes to Python dicts. The keys() method now returns a view object, so

dict.keys()

became

list(dict.keys())

dict.iteritems()

also became

dict.items()

Virtual environments

Python 3 has virtual environments built in, which means we don’t need to install virtualenv anymore. There’s no activate_this.py in Python 3 environments, though, so we switched to using django-dotenv instead.

Miscellaneous

Some more changes we made include imports:

from base import * => from .base import *

function names:

func.func_name => func.__name__

and exceptions:

exception.message => str(exception)
except Exception, e => except Exception as e

Optional

Finally, there were optional changes we made. Python 3 uses UTF-8 encoding for source files by default, so we could remove the encoding line from the top of files. Also, the unicode u” prefix is allowed in Python 3, but not necessary.

Testing HTTP calls in Python

Many applications make calls to external services, or other services that are part of the application. Testing those HTTP calls can be challenging, but there are some different options available in Python.

Mocking

One option for testing your HTTP calls is to mock out your function that makes the HTTP call. This way, your function doesn’t make the HTTP call, since it’s replaced by a mock function that just returns whatever you want it to.

Here’s an example of mocking out your HTTP call:

import requests

class SomeClass:

  def __init__(self):
    self.data = self._fetch_data()

  def _fetch_data(self):
    r = requests.get('https://repository.library.brown.edu/api/collections/')
    return r.json()

  def get_collection_ids(self):
    return [c['id'] for c in self.data['collections']]

from unittest.mock import patch
MOCK_DATA = {'collections': [{'id': 1}, {'id': 2}]}

with patch.object(SomeClass, '_fetch_data', return_value=MOCK_DATA) as mock_method:
  thing = SomeClass()
  assert thing.get_collection_ids() == [1, 2]

Another mocking option is the responses package. Responses mocks out the requests library specifically, so if you’re using requests, you can tell the responses package what you want each requests call to return.

Here’s an example using the responses package (SomeClass is defined the same way as in the first example):

import responses
import json
MOCK_JSON_DATA = json.dumps({'collections': [{'id': 1}, {'id': 2}]})

@responses.activate
def test_some_class():
  responses.add(responses.GET,  'https://repository.library.brown.edu/api/collections/',
 body=MOCK_JSON_DATA,
 status=200,
 content_type='application/json'
 )
  thing = SomeClass()
  assert thing.get_collection_ids() == [1, 2]

test_some_class()

Record & Replay Data

A different type of solution is to use a package to record the responses from your HTTP calls, and then replay those responses automatically for you.

VCR.py – VCR.py is a Python version of the Ruby VCR library, and it supports various HTTP clients, including requests.

Here’s a VCR.py example, again using SomeClass from the first example:

import vcr 
IDS = [674, 278, 280, 282, 719, 300, 715, 659, 468, 720, 716, 687, 286, 288, 290, 296, 298, 671, 733, 672, 334, 328, 622, 318, 330, 332, 625, 740, 626, 336, 340, 338, 725, 724, 342, 549, 284, 457, 344, 346, 370, 350, 656, 352, 354, 356, 358, 406, 663, 710, 624, 362, 721, 700, 661, 364, 660, 718, 744, 702, 688, 366, 667]

with vcr.use_cassette('vcr_cassettes/cassette.yaml'):
  thing = SomeClass()
  fetched_ids = thing.get_collection_ids()
  assert sorted(fetched_ids) == sorted(IDS)

betamax – From the documentation: “Betamax is a VCR imitation for requests.” Note that it is more limited than VCR.py, since it only works for the requests package.

Here’s a betamax example (note: I modified the code in order to test it – maybe there’s a way to test the code with betamax without modifying it?):

import requests

class SomeClass:
    def __init__(self, session=None):
        self.data = self._fetch_data(session)

    def _fetch_data(self, session=None):
         if session:
             r = session.get('https://repository.library.brown.edu/api/collections/')
         else:
             r = requests.get('https://repository.library.brown.edu/api/collections/')
         return r.json()

    def get_collection_ids(self):
        return [c['id'] for c in self.data['collections']]


import betamax
CASSETTE_LIBRARY_DIR = 'betamax_cassettes'
IDS = [674, 278, 280, 282, 719, 300, 715, 659, 468, 720, 716, 687, 286, 288, 290, 296, 298, 671, 733, 672, 334, 328, 622, 318, 330, 332, 625, 740, 626, 336, 340, 338, 725, 724, 342, 549, 284, 457, 344, 346, 370, 350, 656, 352, 354, 356, 358, 406, 663, 710, 624, 362, 721, 700, 661, 364, 660, 718, 744, 702, 688, 366, 667]

session = requests.Session()
recorder = betamax.Betamax(
 session, cassette_library_dir=CASSETTE_LIBRARY_DIR
 )

with recorder.use_cassette('our-first-recorded-session', record='none'):
    thing = SomeClass(session)
    fetched_ids = thing.get_collection_ids()
    assert sorted(fetched_ids) == sorted(IDS)

Integration Test

Note that with all the solutions I listed above, it’s probably safest to cover the HTTP calls with an integration test that interacts with the real service, in addition to whatever you do in your unit tests.

Another possible solution is to test as much as possible with unit tests without testing the HTTP call, and then just rely on the integration test(s) to test the HTTP call. If you’ve constructed your application so that the HTTP call is only a small, isolated part of the code, this may be a reasonable option.

Here’s an example where the class fetches the data if needed, but the data can easily be put into the class for testing the rest of the functionality (without any mocking or external packages):

import requests

class SomeClass:

    def __init__(self):
        self._data = None

    @property
    def data(self):
        if not self._data:
            r = requests.get('https://repository.library.brown.edu/api/collections/')
            self._data = r.json()
        return self._data

    def get_collection_ids(self):
        return [c['id'] for c in self.data['collections']]


import json
MOCK_DATA = {'collections': [{'id': 1}, {'id': 2}]}

def test_some_class():
    thing = SomeClass()
    thing._data = MOCK_DATA
    assert thing.get_collection_ids() == [1, 2]

test_some_class()

Ivy Plus Discovery Day

On June 4-5, 2017 the Library will host the third annual Ivy Plus Discovery Day. “DiscoDay”, as we like to call it, is an opportunity for staff who work on discovery systems (like Blacklight Josiah) to share an update of their work in progress and discuss common issues.

On Sunday, June 4 we will have a hackathon on these two topics.

StackLife — integrating virtual browse in discovery systems
Linked Data Authorities — leveraging authorities to provide users with another robust method for exploring our data and finding materials of interest

On Monday, June 5 there will be a full day of sharing and unconference discussion sessions.

We expect about 40 staff from the 13 Ivy Plus Libraries. We’ve initially limited participation to three staff from each institution and we hope to have a good mix of developers, metadata specialists, user experience librarians and others whose work is closely tied to the institution’s discovery system.

For more information about Discovery Day see: https://library.brown.edu/create/discoveryday/

Updating Young Mindsets: Technology in the Library

In late September I saw an announcement about a computer science talk which referred to an acronym I wasn’t familiar with: CS4RI. A bit of googling led to the organization ‘CS4RI: Computer Science for Rhode Island'[1]. Its website included a link to their ‘CS4RI Summit 2016′[2], which stated as a goal: “The CS4RI Summit aims to inspire the next generation of computer scientists, entrepreneurs, and engaged tech sector employees… let’s excite students with the many educational and career opportunities that result from studying CS…”

Our Digital Technologes department is a vibrant place to work. We combine focused productivity with a culture of learning about new technologies and practices. And Libraries at our peer institutions are also known to be terrific places to work.

It occurred to me that while CS middle and high-school students would suspect that robotics or game-design organizations could be interesting places to work — they’d very likely never think that Libraries could be worth considering. A few of us set out to remedy that; we reserved an exhibit-table at the CS4RI Summit held at the University of Rhode Island on December 14, 2016.

It was a wonderful event. All sorts of interesting tech companies and organizations exhibited for some 1,500 students. At our exhibit, we talked about how Amazon and Google set the bar for making things easy to find, and easy to get — and how Libraries have worked hard to improve the discoverability and accessibility of our services. We shared that we’ve hired and are continuing to hire people with computer-science and other technology backgrounds. We noted that our Digital Technologies team gets to partner with researchers working on all sorts of interesting issues and technologies. And we let students know we work with, and contribute to, open-source technologies so that our work benefits not only our users, but a much wider community of learners.

A few hundreds of middle and high-schoolers now have some awareness that Libraries are worth considering for working with technology in a variety of creative, dynamic, and rewarding ways.

(For making this possible and successful, thanks to Bruce Boucek, Hector Correa, Jean Rainwater, Kerri Hicks, Patrick Rashleigh, and Shashi Mishra.)

[1] <http://www.cs4ri.org>
[2] <http://www.cs4ri.org/summit>