Brown University Homepage Brown University Library

Migration: Fedora 3 to OCFL

A previous post described the current storage setup of the Brown Digital Repository. However, until recently BDR storage was handled by Fedora 3. This post will describe how we migrated over one million objects from Fedora 3 to OCFL, without taking down the repository.

The first step was to isolate the storage backend behind our BDR APIs and a Django storage service (this idea wasn’t new to the migration – we’ve been working on our API layer for years, long before the migration started). So, users and client applications did not hit Fedora directly – they went through the storage service or the APIs for reads and writes. This let us contain the storage changes to just the APIs and storage service, without needing to update the other various applications that interacted with the BDR.

For our migration, we decided to set up the new OCFL system while Fedora 3 was still running, and run them both in parallel during the migration. This would minimize the downtime, and the BDR would not be unavailable or read-only for days or weeks while the migration script migrated our ~1 million Fedora 3 objects. We set up our OCFL HTTP service layer, and updated our APIs to be able to post new objects to OCFL and update objects either in Fedora or OCFL. We also updated our storage service to check for an object in OCFL, and if the object hadn’t been migrated to OCFL yet, the storage service would fall back to reading from Fedora. Once these changes were enabled, new objects were posted to OCFL and updated there, and old objects in Fedora were updated in Fedora. At this point, object files could change in Fedora, but we had a static set of Fedora objects to migrate.

The general plan for migrating was to divide the objects into batches, and migrate each batch individually. We mounted our Fedora storage a second time on the server as read-only, so the migration process would not be able to write to the Fedora data. We used a small custom script to walk the whole Fedora filesystem, listing all the object pids in 12 batch files. For each batch, we used our fork of the Fedora community’s migration-utils program to migrate the Fedora data over to OCFL. We migrated to plain OCFL, however, instead of creating Fedora 6 objects. We also chose to migrate the whole Fedora 3 FOXML file, and not store the object and datastream properties in small RDF files. If the object was marked as ‘D’eleted in Fedora, we marked it as deleted in OCFL by deleting all the files in the final version of the object. After the batch was migrated, we checked for errors.

One issue we ran into was slowness – one batch of 100,000 objects could take days to finish. This was much slower than a dev server migration, where we migrated over 30,000 objects in ~1.25 hours. We could have sped up the process by turning off fixity checking, but we wanted to make sure the data was being migrated correctly. We added memory to our server, but that didn’t help much. Eventually, we used four temporary servers to run multiple migration batches in parallel, which helped us finish the process faster.

We had to deal with another kind of issue where objects were updated in Fedora during the migration batch run (because we kept Fedora read-write during the migration). In one batch, we had 112 of the batch objects updated in Fedora. The migration of these objects failed completely, so we just needed to add the object PIDs to a cleanup batch, and then they were successfully migrated.

The migration script failed to migrate some objects because the Fedora data was corrupt – ie. file versions listed in the FOXML were not on disk, or file versions were listed out-of-order in the FOXML. We used a custom migration script for these objects, so we could still migrate the existing Fedora filesystem files over to OCFL.

Besides the fixity checking that the migration script performed as it ran, we also ran some verification checks after the migration. From our API logs, we verified that the final object update in Fedora was on 2021-05-20. On 2021-06-22, we kicked off a script that took all the objects in the Fedora storage and verified that each object’s Fedora FOXML file was identical to the FOXML file in OCFL (except for some objects that didn’t need to be migrated). Verifying all the FOXML files shows that the migration process was working correctly, that every object had been migrated, and that there were no missed updates to the Fedora objects – because any Fedora object update would change the FOXML. Starting on 2021-06-30, we took the lists of objects that had an error during the migration and used a custom script to verify that each of the files for that object on the Fedora filesystem was also in OCFL, and the checksums matched.

Once all the objects were migrated to OCFL, we could start shutting down the Fedora system and removing code for handling both systems. We updated the APIs and the storage service to remove Fedora-related code, and we were able to update our indexer, storage service, and Cantaloupe IIIF server to read all objects directly from the OCFL storage. We shut down Fedora 3, and did some cleanup on the Fedora files. We also saved the migration artifacts (notes, data, scripts, and logs) in the BDR to be preserved.

BDR Storage Architecture

We recently migrated the Brown Digital Repository (BDR) storage from Fedora 3 to OCFL. In this blog post, I’ll describe our current setup, and then a future post will discuss the process we used to migrate.

Our BDR data is currently stored in an OCFL repository1. We like having a standardized, specified layout for the files in our repository – we can use any software written for OCFL, or we can write it ourselves. Using the OCFL standard should also help us minimize data migrations in the future, as we won’t need to switch from one application’s custom file layout to a new application’s custom layout. OCFL repositories can be understood just from the files on disk, and databases or indexes can be rebuilt from those files. Backing up the repository only requires backing up the filesystem – there’s no metadata stored in a separate database. OCFL also has versioning and checksums built in for every file in the repository. OCFL gives us an algorithm to find any object on disk (and all the object files are contained in that one object directory), which is much nicer than our previous Fedora storage where objects and files were hard to find because they were spread out in various directories based on the date they were added to the BDR.

In the BDR, we’re storing the data on shared enterprise storage, accessed over NFS. We use an OCFL storage layout extension that splits the objects into directory trees, and encapsulates the object files in a directory with a name based on the object ID. We wrote an HTTP service for the OCFL-java client used by Fedora 6. We use this HTTP service for writing new objects and updates to the repository – this service is the only process that needs read-write access to the data.

We use processes with read-only access (either run by a different user, or on a different server with a read-only mount) to provide additional functionality. Our fixity checking script walks part of the BDR each night and verifies the checksums listed in the OCFL inventory.json file. Our indexer process reads the files in a object, extracts data, and posts the index data to Solr. Our Django-based storage service reads files from the repository to serve the content to users. Each of these services uses our bdrocfl package, which is not a general OCFL client – it contains code for reading our specific repository, with our storage layout and reading the information we need from our files. We also run the Cantaloupe IIIF image server, and we added a custom jruby delegate with some code that knows how to find an object and then a file within the object.

We could add other read-only processes to the BDR in the future. For example, we could add a backup process that crawls the repository, saving each object version to a different storage location. OCFL versions are immutable, and that would simplify the backup process because we would only have to back up new version directories for each object.

1. Collection information like name, description, … is actually stored in a separate database, but hopefully we will migrate that to the OCFL repository soon.