We recently migrated the Brown Digital Repository (BDR) storage from Fedora 3 to OCFL. In this blog post, I’ll describe our current setup, and then a future post will discuss the process we used to migrate.
Our BDR data is currently stored in an OCFL repository1. We like having a standardized, specified layout for the files in our repository – we can use any software written for OCFL, or we can write it ourselves. Using the OCFL standard should also help us minimize data migrations in the future, as we won’t need to switch from one application’s custom file layout to a new application’s custom layout. OCFL repositories can be understood just from the files on disk, and databases or indexes can be rebuilt from those files. Backing up the repository only requires backing up the filesystem – there’s no metadata stored in a separate database. OCFL also has versioning and checksums built in for every file in the repository. OCFL gives us an algorithm to find any object on disk (and all the object files are contained in that one object directory), which is much nicer than our previous Fedora storage where objects and files were hard to find because they were spread out in various directories based on the date they were added to the BDR.
In the BDR, we’re storing the data on shared enterprise storage, accessed over NFS. We use an OCFL storage layout extension that splits the objects into directory trees, and encapsulates the object files in a directory with a name based on the object ID. We wrote an HTTP service for the OCFL-java client used by Fedora 6. We use this HTTP service for writing new objects and updates to the repository – this service is the only process that needs read-write access to the data.
We use processes with read-only access (either run by a different user, or on a different server with a read-only mount) to provide additional functionality. Our fixity checking script walks part of the BDR each night and verifies the checksums listed in the OCFL inventory.json file. Our indexer process reads the files in a object, extracts data, and posts the index data to Solr. Our Django-based storage service reads files from the repository to serve the content to users. Each of these services uses our bdrocfl package, which is not a general OCFL client – it contains code for reading our specific repository, with our storage layout and reading the information we need from our files. We also run the Cantaloupe IIIF image server, and we added a custom jruby delegate with some code that knows how to find an object and then a file within the object.
We could add other read-only processes to the BDR in the future. For example, we could add a backup process that crawls the repository, saving each object version to a different storage location. OCFL versions are immutable, and that would simplify the backup process because we would only have to back up new version directories for each object.
1. Collection information like name, description, … is actually stored in a separate database, but hopefully we will migrate that to the OCFL repository soon.