Ben Cail – Brown University Library Digital Technologies

Migration: Fedora 3 to OCFL

A previous post described the current storage setup of the Brown Digital Repository. However, until recently BDR storage was handled by Fedora 3. This post will describe how we migrated over one million objects from Fedora 3 to OCFL, without taking down the repository.

The first step was to isolate the storage backend behind our BDR APIs and a Django storage service (this idea wasn’t new to the migration – we’ve been working on our API layer for years, long before the migration started). So, users and client applications did not hit Fedora directly – they went through the storage service or the APIs for reads and writes. This let us contain the storage changes to just the APIs and storage service, without needing to update the other various applications that interacted with the BDR.

For our migration, we decided to set up the new OCFL system while Fedora 3 was still running, and run them both in parallel during the migration. This would minimize the downtime, and the BDR would not be unavailable or read-only for days or weeks while the migration script migrated our ~1 million Fedora 3 objects. We set up our OCFL HTTP service layer, and updated our APIs to be able to post new objects to OCFL and update objects either in Fedora or OCFL. We also updated our storage service to check for an object in OCFL, and if the object hadn’t been migrated to OCFL yet, the storage service would fall back to reading from Fedora. Once these changes were enabled, new objects were posted to OCFL and updated there, and old objects in Fedora were updated in Fedora. At this point, object files could change in Fedora, but we had a static set of Fedora objects to migrate.

The general plan for migrating was to divide the objects into batches, and migrate each batch individually. We mounted our Fedora storage a second time on the server as read-only, so the migration process would not be able to write to the Fedora data. We used a small custom script to walk the whole Fedora filesystem, listing all the object pids in 12 batch files. For each batch, we used our fork of the Fedora community’s migration-utils program to migrate the Fedora data over to OCFL. We migrated to plain OCFL, however, instead of creating Fedora 6 objects. We also chose to migrate the whole Fedora 3 FOXML file, and not store the object and datastream properties in small RDF files. If the object was marked as ‘D’eleted in Fedora, we marked it as deleted in OCFL by deleting all the files in the final version of the object. After the batch was migrated, we checked for errors.

One issue we ran into was slowness – one batch of 100,000 objects could take days to finish. This was much slower than a dev server migration, where we migrated over 30,000 objects in ~1.25 hours. We could have sped up the process by turning off fixity checking, but we wanted to make sure the data was being migrated correctly. We added memory to our server, but that didn’t help much. Eventually, we used four temporary servers to run multiple migration batches in parallel, which helped us finish the process faster.

We had to deal with another kind of issue where objects were updated in Fedora during the migration batch run (because we kept Fedora read-write during the migration). In one batch, we had 112 of the batch objects updated in Fedora. The migration of these objects failed completely, so we just needed to add the object PIDs to a cleanup batch, and then they were successfully migrated.

The migration script failed to migrate some objects because the Fedora data was corrupt – ie. file versions listed in the FOXML were not on disk, or file versions were listed out-of-order in the FOXML. We used a custom migration script for these objects, so we could still migrate the existing Fedora filesystem files over to OCFL.

Besides the fixity checking that the migration script performed as it ran, we also ran some verification checks after the migration. From our API logs, we verified that the final object update in Fedora was on 2021-05-20. On 2021-06-22, we kicked off a script that took all the objects in the Fedora storage and verified that each object’s Fedora FOXML file was identical to the FOXML file in OCFL (except for some objects that didn’t need to be migrated). Verifying all the FOXML files shows that the migration process was working correctly, that every object had been migrated, and that there were no missed updates to the Fedora objects – because any Fedora object update would change the FOXML. Starting on 2021-06-30, we took the lists of objects that had an error during the migration and used a custom script to verify that each of the files for that object on the Fedora filesystem was also in OCFL, and the checksums matched.

Once all the objects were migrated to OCFL, we could start shutting down the Fedora system and removing code for handling both systems. We updated the APIs and the storage service to remove Fedora-related code, and we were able to update our indexer, storage service, and Cantaloupe IIIF server to read all objects directly from the OCFL storage. We shut down Fedora 3, and did some cleanup on the Fedora files. We also saved the migration artifacts (notes, data, scripts, and logs) in the BDR to be preserved.

BDR Storage Architecture

We recently migrated the Brown Digital Repository (BDR) storage from Fedora 3 to OCFL. In this blog post, I’ll describe our current setup, and then a future post will discuss the process we used to migrate.

Our BDR data is currently stored in an OCFL repository¹. We like having a standardized, specified layout for the files in our repository – we can use any software written for OCFL, or we can write it ourselves. Using the OCFL standard should also help us minimize data migrations in the future, as we won’t need to switch from one application’s custom file layout to a new application’s custom layout. OCFL repositories can be understood just from the files on disk, and databases or indexes can be rebuilt from those files. Backing up the repository only requires backing up the filesystem – there’s no metadata stored in a separate database. OCFL also has versioning and checksums built in for every file in the repository. OCFL gives us an algorithm to find any object on disk (and all the object files are contained in that one object directory), which is much nicer than our previous Fedora storage where objects and files were hard to find because they were spread out in various directories based on the date they were added to the BDR.

In the BDR, we’re storing the data on shared enterprise storage, accessed over NFS. We use an OCFL storage layout extension that splits the objects into directory trees, and encapsulates the object files in a directory with a name based on the object ID. We wrote an HTTP service for the OCFL-java client used by Fedora 6. We use this HTTP service for writing new objects and updates to the repository – this service is the only process that needs read-write access to the data.

We use processes with read-only access (either run by a different user, or on a different server with a read-only mount) to provide additional functionality. Our fixity checking script walks part of the BDR each night and verifies the checksums listed in the OCFL inventory.json file. Our indexer process reads the files in a object, extracts data, and posts the index data to Solr. Our Django-based storage service reads files from the repository to serve the content to users. Each of these services uses our bdrocfl package, which is not a general OCFL client – it contains code for reading our specific repository, with our storage layout and reading the information we need from our files. We also run the Cantaloupe IIIF image server, and we added a custom jruby delegate with some code that knows how to find an object and then a file within the object.

We could add other read-only processes to the BDR in the future. For example, we could add a backup process that crawls the repository, saving each object version to a different storage location. OCFL versions are immutable, and that would simplify the backup process because we would only have to back up new version directories for each object.

1. Collection information like name, description, … is actually stored in a separate database, but hopefully we will migrate that to the OCFL repository soon.

PyPI packages

Recently, we published two Python packages to PyPI: bdrxml and bdrcmodels. No one else is using those packages, as far as I know, and it takes some effort to put them up there, but there are benefits from publishing them.

Putting a package on PyPI makes it easier for other code we package up to depend on bdrxml. For our indexing package, we can switch from this:

‘bdrxml @ https://github.com/Brown-University-Library/bdrxml/archive/v1.0a1.zip#sha1=5802ed82ee80a9627657cbb222fe9c056f73ad2c’,

to this:

‘bdrxml>=1.0’,

in setup.py, which is simpler. This also lets us using Python’s package version checking to not pin bdrxml to just one version, which is helpful when we embed the indexing package in another project that may use a different version of bdrxml.

Publishing these first two packages also gave us experience, which will help if we publish more packages to PyPI.

Deploying with shiv

I recently watched a talk called “Containerless Django – Deploying without Docker”, by Peter Baumgartner. Peter lists some benefits of Docker: that it gives you a pipeline for getting code tested and deployed, the container adds some security to the app, state can be isolated in the container, and it lets you run the exact same code in development and production.

Peter also lists some drawbacks to Docker: it’s a lot of code that could slow things down or have bugs, docker artifacts can be relatively large, and it adds extra abstractions to the system (eg. filesystem, network). He argues that an ideal deployment would include downloading a binary, creating a configuration file, and running it (like one can do with compiled C or Go programs).

Peter describes a process of deploying Django apps by creating a zipapp using shiv and goodconf, and deploying it with systemd constraints that add to the security. He argues that this process achieves most of the benefits of Docker, but more simply, and that there’s a sweet spot for application size where this type of deploy is a good solution.

I decided to try using shiv with our image server Loris. I ran the shiv command “shiv -o loris.pyz .”, and I got the following error:

User “loris” and or group “loris” do(es) not exist.
Please create this user, e.g.:
`useradd -d /var/www/loris -s /sbin/false loris`

The issue is that in the Loris setup.py file, the install process not only checks for the loris user as shown in the error, but it also sets up directories on the filesystem (including setting the owner and permission, which requires root permissions). I submitted a PR to remove the filesystem setup from the Python package installation (and put it in a script the user can run), and hopefully in the future it will be easier to package up Loris and deploy it different ways.

Checksums

In the BDR, we calculate checksums automatically on ingest (Fedora 3 provides that functionality for us), so all new content binaries going into the BDR get a checksum, which we can go back and check later as needed.

We can also pass checksums into the BDR API, and then we verify that Fedora calculates the same checksum for the ingested file, which shows that the content wasn’t modified since the first checksum was calculated. We have only been able to use MD5 checksums, but we want to be able to use more checksum types. This isn’t a problem for Fedora, which can calculate multiple checksum types, such as MD5, SHA1, SHA256, and SHA512.

However, there is a complicating factor – if Fedora gets a checksum mismatch, by default it returns a 500 response code with no message, so we can’t tell whether it was a checksum mismatch or some other server error. Thanks to Ben Armintor, though, we found that we can update our Fedora configuration so it returns the Checksum Mismatch information.

Another issue in this process is that we use eulfedora (which doesn’t seem to be maintained anymore). If a checksum mismatch happens, it raises a DigitalObjectSaveFailure, but we want to know that there was a checksum mismatch. We forked eulfedora and exposed the checksum mismatch information. Now we can remove some extra code that we had in our APIs, since more functionality is handled in Fedora/eulfedora, and we can use multiple checksum types.

Exporting Django data

We recently had a couple cases where we wanted to dump the data out of a Django database. In the first case (“tracker”), we were shutting down a legacy application, but needed to preserve the data in a different form for users. In the second case (“deposits”), we were backing up some obsolete data before removing it from the database. We handled the processes in two different ways.

Tracker

For the tracker, we used an export script to extract the data. Here’s a modified version of the script:

def export_data():
    now = datetime.datetime.now()
    dir_name = 'data_%s_%s_%s' % (now.year, now.month, now.day)
    d = os.mkdir(dir_name)
    file_name = os.path.join(dir_name, 'tracker_items.dat')
    with open(file_name, 'wb') as f:
        f.write(u'\u241f'.join([
                    'project name',
                    'container identifier',
                    'container name',
                    'identifier',
                    'name',
                    'dimensions',
                    'note',
                    'create digital surrogate',
                    'qc digital surrogate',
                    'create metadata record',
                    'qc metadata record',
                    'create submission package']).encode('utf8'))
        f.write('\u241e'.encode('utf8'))
        for project in models.Project.objects.all():
            for container in project.container_set.all():
                print(container)
                for item in container.item_set.all():
                    data = u'\u241f'.join([
                        project.name.strip(),
                        container.identifier.strip(),
                        container.name.strip(),
                        item.identifier.strip(),
                        item.name.strip(),
                        item.dimensions.strip(),
                        item.note.strip()
                    ])
                    item_actions = u'\u241f'.join([str(item_action) for item_action in item.itemaction_set.all().order_by('id')])
                    line_data = u'%s\u241f%s\u241e' % (data, item_actions)
                    f.write(line_data.encode('utf8'))

As you can see, we looped through different Django models and pulled out fields, writing everything to a file. We used the Unicode Record and Unit Separators as delimiters. One advantage of using those is that your data can have commas, tabs, newlines, … and it won’t matter. You still don’t have to quote or escape anything.

Then we converted the data to a spreadsheet that users can view and search:

import openpyxl

workbook = openpyxl.Workbook()
worksheet = workbook.active

with open('tracker_items.dat', 'rb') as f:
    data = f.read()
    lines = data.decode('utf8').split('\u241e')
    print(len(lines))
    print(lines[0])
    print(lines[-1])
    for line in lines:
        fields = line.split('\u241f')
        worksheet.append(fields)
workbook.save('tracker_items.xlsx')

Deposits

For the deposits project, we just used the built-in Django dumpdata command:

python manage.py dumpdata -o data_20180727.dat

That output file could be used to load data back into a database if needed.

Looking at the Oxford Common Filesystem Layout (OCFL)

Currently, the BDR contains about 34TB of content. The storage layer is Fedora 3, and the data is stored internally by Fedora (instead of being stored externally). However, Fedora 3 is end-of-life. This means that we either maintain it ourselves, or migrate to something else. However, we don’t want to migrate 34TB, and then have to migrate it again if we change software again. We’d like to be able to change our software, without migrating all our data.

This is where the Oxford Common Filesystem Layout (OCFL) work is interesting. OCFL is an effort to define how repository objects should be laid out on the filesystem. OCFL is still very much a work-in-progress, but the “Need” section of the specification speaks directly to what I described above. If we set up our data using OCFL, hopefully we can upgrade and change our software as necessary without having to move all the data around.

Another benefit of the OCFL effort is that it’s work being done by people from multiple institutions, building on other work and experience in this area, to define a good, well-thought-out layout for repository objects.

Finally, using a common specification for the filesystem layout of our repository means that there’s a better chance that other software will understand how to interact with our files on disk. The more people using the same filesystem layout, the more potential collaborators and applications for implementing the OCFL specification – safely creating, updating, and serving out content for the repository.

Thumbnail cache

The BDR provides thumbnails for image and PDF objects. The thumbnail service is set up to check for a thumbnail in storage, then try to generate one from the IIIF image server, and fall back to an icon if needed. Thumbnails are cached by the thumbnail service for 30 days, up to a maximum of 5000 thumbnails. The maximum number of thumbnails was the limiting factor on the thumbnails in the cache – they weren’t being purged from the cache because they were older than 30 days, but because the cache filled up.

We recently had the need to purge some thumbnails, because we updated the images for some objects. We decided to update the thumbnail caching, so that the cache is timestamped. We already use our API to check for permissions on an object before displaying the thumbnail, and we added an API check for when the object in storage was last modified. If the object was modified more recently than the cache timestamp, the cache is stale and we grab an updated thumbnail. This should keep the thumbnail cache from serving stale images.

Automating Streaming Objects

In the BDR, we can provide streaming versions for audio or video content, in addition to (or instead of) file download. We used Wowza for streaming content in the past, but now we use Panopto.

The process for getting content to Panopto has been manual: download the file from the BDR, upload it to Panopto, set the correct session name in Panopto, and associate the Panopto ID with the BDR record. I’ve been working on automating this process, though.

Here are the steps that the automation code performs:

download audio or video file from BDR to a temporary file
hit the Panopto API to upload file
- create the session and upload in Panopto
- use the Amazon S3 protocol to upload the file
- mark the upload as complete so Panopto starts processing the file
create a streaming object in the BDR with the correct Panopto session ID

We want to make sure that the process can handle large files without running out of memory or taxing the server too much. So, we stream the content to the temporary file in chunks. Then, when we upload the file to Panopto, we’d like to do that in chunks as well, so we’re never reading the whole file into memory – unfortunately, we’re currently running into an error with the multipart upload.

This automation will reduce the amount of manual work we do for streaming content, and could open the door to creating streaming objects automatically on request from non-BDR staff (or even users).

MySQL 5.7 migration

We recently migrated the BDR databases from MySQL version 5.5 to 5.7. Here are a couple benefits for us as application developers:

Stricter Data Handling

By default, MySQL 5.7 uses stricter data handling than 5.5, so we don’t have to manually put MySQL into strict mode.
MySQL 5.5’s loose data handling bit us last summer. We have an application where files can be uploaded, and the file names are stored in the database. A user started getting errors trying to upload new files, because the file names were duplicates (all the file names in the database are required to be unique). It turned out that the file names were too long for the field, so they were being truncated and put into the table anyway. Then, duplicate errors were thrown if a new file name truncated to the same name as another truncated file name. After that, we put MySQL into strict mode for some of our databases, but now it will be that way by default.

Support

The second benefit is that Django 2.1 won’t support 5.5 anymore, and MySQL 5.5 will be End-of-Life this year, so this migration gets us on a better-supported version of MySQL.

Now, if only ‘UTF-8’ in MySQL actually meant UTF-8… Actually, MySQL 8.0 was recently released, and it looks like it uses UTF8MB4 (ie. real UTF-8) by default, so that may be helpful in the future when we move to 8.0.