Brown University Library Digital Technologies

Migration: Fedora 3 to OCFL

A previous post described the current storage setup of the Brown Digital Repository. However, until recently BDR storage was handled by Fedora 3. This post will describe how we migrated over one million objects from Fedora 3 to OCFL, without taking down the repository.

The first step was to isolate the storage backend behind our BDR APIs and a Django storage service (this idea wasn’t new to the migration – we’ve been working on our API layer for years, long before the migration started). So, users and client applications did not hit Fedora directly – they went through the storage service or the APIs for reads and writes. This let us contain the storage changes to just the APIs and storage service, without needing to update the other various applications that interacted with the BDR.

For our migration, we decided to set up the new OCFL system while Fedora 3 was still running, and run them both in parallel during the migration. This would minimize the downtime, and the BDR would not be unavailable or read-only for days or weeks while the migration script migrated our ~1 million Fedora 3 objects. We set up our OCFL HTTP service layer, and updated our APIs to be able to post new objects to OCFL and update objects either in Fedora or OCFL. We also updated our storage service to check for an object in OCFL, and if the object hadn’t been migrated to OCFL yet, the storage service would fall back to reading from Fedora. Once these changes were enabled, new objects were posted to OCFL and updated there, and old objects in Fedora were updated in Fedora. At this point, object files could change in Fedora, but we had a static set of Fedora objects to migrate.

The general plan for migrating was to divide the objects into batches, and migrate each batch individually. We mounted our Fedora storage a second time on the server as read-only, so the migration process would not be able to write to the Fedora data. We used a small custom script to walk the whole Fedora filesystem, listing all the object pids in 12 batch files. For each batch, we used our fork of the Fedora community’s migration-utils program to migrate the Fedora data over to OCFL. We migrated to plain OCFL, however, instead of creating Fedora 6 objects. We also chose to migrate the whole Fedora 3 FOXML file, and not store the object and datastream properties in small RDF files. If the object was marked as ‘D’eleted in Fedora, we marked it as deleted in OCFL by deleting all the files in the final version of the object. After the batch was migrated, we checked for errors.

One issue we ran into was slowness – one batch of 100,000 objects could take days to finish. This was much slower than a dev server migration, where we migrated over 30,000 objects in ~1.25 hours. We could have sped up the process by turning off fixity checking, but we wanted to make sure the data was being migrated correctly. We added memory to our server, but that didn’t help much. Eventually, we used four temporary servers to run multiple migration batches in parallel, which helped us finish the process faster.

We had to deal with another kind of issue where objects were updated in Fedora during the migration batch run (because we kept Fedora read-write during the migration). In one batch, we had 112 of the batch objects updated in Fedora. The migration of these objects failed completely, so we just needed to add the object PIDs to a cleanup batch, and then they were successfully migrated.

The migration script failed to migrate some objects because the Fedora data was corrupt – ie. file versions listed in the FOXML were not on disk, or file versions were listed out-of-order in the FOXML. We used a custom migration script for these objects, so we could still migrate the existing Fedora filesystem files over to OCFL.

Besides the fixity checking that the migration script performed as it ran, we also ran some verification checks after the migration. From our API logs, we verified that the final object update in Fedora was on 2021-05-20. On 2021-06-22, we kicked off a script that took all the objects in the Fedora storage and verified that each object’s Fedora FOXML file was identical to the FOXML file in OCFL (except for some objects that didn’t need to be migrated). Verifying all the FOXML files shows that the migration process was working correctly, that every object had been migrated, and that there were no missed updates to the Fedora objects – because any Fedora object update would change the FOXML. Starting on 2021-06-30, we took the lists of objects that had an error during the migration and used a custom script to verify that each of the files for that object on the Fedora filesystem was also in OCFL, and the checksums matched.

Once all the objects were migrated to OCFL, we could start shutting down the Fedora system and removing code for handling both systems. We updated the APIs and the storage service to remove Fedora-related code, and we were able to update our indexer, storage service, and Cantaloupe IIIF server to read all objects directly from the OCFL storage. We shut down Fedora 3, and did some cleanup on the Fedora files. We also saved the migration artifacts (notes, data, scripts, and logs) in the BDR to be preserved.

BDR Storage Architecture

We recently migrated the Brown Digital Repository (BDR) storage from Fedora 3 to OCFL. In this blog post, I’ll describe our current setup, and then a future post will discuss the process we used to migrate.

Our BDR data is currently stored in an OCFL repository¹. We like having a standardized, specified layout for the files in our repository – we can use any software written for OCFL, or we can write it ourselves. Using the OCFL standard should also help us minimize data migrations in the future, as we won’t need to switch from one application’s custom file layout to a new application’s custom layout. OCFL repositories can be understood just from the files on disk, and databases or indexes can be rebuilt from those files. Backing up the repository only requires backing up the filesystem – there’s no metadata stored in a separate database. OCFL also has versioning and checksums built in for every file in the repository. OCFL gives us an algorithm to find any object on disk (and all the object files are contained in that one object directory), which is much nicer than our previous Fedora storage where objects and files were hard to find because they were spread out in various directories based on the date they were added to the BDR.

In the BDR, we’re storing the data on shared enterprise storage, accessed over NFS. We use an OCFL storage layout extension that splits the objects into directory trees, and encapsulates the object files in a directory with a name based on the object ID. We wrote an HTTP service for the OCFL-java client used by Fedora 6. We use this HTTP service for writing new objects and updates to the repository – this service is the only process that needs read-write access to the data.

We use processes with read-only access (either run by a different user, or on a different server with a read-only mount) to provide additional functionality. Our fixity checking script walks part of the BDR each night and verifies the checksums listed in the OCFL inventory.json file. Our indexer process reads the files in a object, extracts data, and posts the index data to Solr. Our Django-based storage service reads files from the repository to serve the content to users. Each of these services uses our bdrocfl package, which is not a general OCFL client – it contains code for reading our specific repository, with our storage layout and reading the information we need from our files. We also run the Cantaloupe IIIF image server, and we added a custom jruby delegate with some code that knows how to find an object and then a file within the object.

We could add other read-only processes to the BDR in the future. For example, we could add a backup process that crawls the repository, saving each object version to a different storage location. OCFL versions are immutable, and that would simplify the backup process because we would only have to back up new version directories for each object.

1. Collection information like name, description, … is actually stored in a separate database, but hopefully we will migrate that to the OCFL repository soon.

Bundler 2.1.4 and homeless accounts

This week we upgraded a couple of our applications to Ruby 2.7 and Bundler 2.1.4 and one of the changes that we noticed was that Bundler was complaining about not being able to write to the /opt/local directory.

Turns out this problem shows up because the account that we use to run our application is a system account that does not have a home folder.

This is how the problems shows up:

$ su - system_account
$ pwd
/opt/local

$ mkdir test_app
$ cd test_app
$ pwd
/opt/local/test_app

$ gem install bundler -v 2.1.4
$ bundler --version
`/opt/local` is not writable.
Bundler will use `/tmp/bundler20200731-59360-174h3lz59360' as your home directory temporarily.
Bundler version 2.1.4

Notice that Bundler complains about the /opt/local directory not being writable, that’s because we don’t have home for this user, in fact env $HOME outputs /opt/local rather than the typical /home/username.

Although Bundler is smart enough to use a temporary folder instead and continue, the net result of this is that if we set a configuration value for Bundler in one execution and try to use that configuration value in the next execution Bundler won’t be able to find the value that we set in the first execution (my guess is because the value was saved in a temporary folder.)

Below is an example of this. Notice how we set the path value to vendor/bundle in the first command, but then when we inspect the configuration in the second command the configuration does not report the value that we just set:

# First - set the path value
$ bundle config set path 'vendor/bundle'
`/opt/local` is not writable.
Bundler will use `/tmp/bundler20200731-60203-16okmcg60203' as your home directory temporarily.

# Then - inspect the configuration
$ bundle config
`/opt/local` is not writable.
Bundler will use `/tmp/bundler20200731-60292-1r50oed60292' as your home directory temporarily.
Settings are listed in order of priority. The top value will be used.

Ideally the call to bundle config will report the vendor/bundle path that we set, but it does not in this case. In fact if we run bundle install next Bundler will install the gems in $GEM_PATH rather than using the custom vendor/bundle directory that we indicated.

Working around the issue

One way to work around this issue is to tell Bundler that the HOME directory is the one from where we are running bundler (i.e. /opt/local/test_app) in our case.

# First - set the path value 
# (no warning is reported)
$ HOME=/opt/local/test_app/ bundle config set path 'vendor/bundle'

# Then - inspect the configuration
$ bundle config
`/opt/local` is not writable.
Bundler will use `/tmp/bundler20200731-63230-11dmgcb63230' as your home directory temporarily.
Settings are listed in order of priority. The top value will be used.
path
Set for your local app (/opt/local/test_app/.bundle/config): "vendor/bundle"

Notice that we didn’t get a warning in the first command (since we indicated a HOME directory) and then, even though we didn’t pass a HOME directory to the second command, our value was picked up and shows the correct value for the path setting (vendor/bundle).

So it seems to me that when HOME is set to a non-writable directory (/opt/local in our case) Bundler picks up the values from ./bundle/config if it is available even as it complains about /opt/local not being writable.

If we were to run bundle install now it will install the gems in our local vendor/bundle directory. This is good for us, Bundler is using the value that we configured for the path setting (even though it still complains that it cannot write to /opt/local.)

We could avoid the warning in the second command if we pass the HOME value here too:

$ HOME=/opt/local/test-app/ bundle config
Settings are listed in order of priority. The top value will be used.
path
Set for your local app (/opt/local/test-app/.bundle/config): "vendor/bundle"

But the fact the Bundler picks up the correct values from ./bundle/config when HOME is set to a non-writable directory was important for us because it meant that when the app runs under Apache/Passenger it will also work. This is more or less how the configuration for our apps in http.conf looks like, notice that we are not setting the HOME value.

<Location />  
  PassengerBaseURI /test-app
  PassengerUser system_account
  PassengerRuby /opt/local/rubies/ruby-2.7.1/bin/ruby
  PassengerAppRoot /opt/local/test-app
  SetEnv GEM_PATH /opt/local/.gem/ruby/2.7.1/
</Location>

Some final thoughts

Perhaps a better solution would be to set a HOME directory for our system_account, but we have not tried that, we didn’t want to make such a wide reaching change to our environment just to please Bundler. Plus this might be problematic in our development servers where we share the same system_account for multiple applications (this is not a problem in our production servers)

We have no idea when this change took effect in Bundler. We went from Bundler 1.17.1 (released in October/2018) to Bundler 2.1.4 (released in January/2020) and there were many releases in between. Perhaps this was documented somewhere and we missed it.

In our particular situation we noticed this issue because one of our gems needed very specific parameters to be built during bundle install. We set those values via a call to bundle config build.mysql2 --with-mysql-dir=xxx mysql-lib=yyy and those values were lost by the time we ran bundle install and the installation kept failing. Luckily we found a work around and were able to install the gem with the specific parameters.

Upgrading from Solr 4 to Solr 7

A few weeks ago we upgraded the version of Solr that we use in our Discovery layer, we went from Solr 4.9 to Solr 7.5. Although we have been using Solr 7.x in other areas of the library this was a significant upgrade for us because searching is the raison d’être of our Discovery layer and we wanted to make sure that the search results did not change in unexpected ways with the new field and server configurations in Solr. All in all the process went smooth for our users. This blog post elaborates on some of the things that we had to do in order to upgrade.

Managed Schema

This is the first Solr that we setup to use the managed-schema feature in Solr. This allows us to define field types and fields via the Schema API rather than by editing XML files. All in all this was a good decision and it allows us to recreate our Solr instances by running a shell script rather than by copying XML files. This feature was very handy during testing when we needed to recreate our Solr core for testing purposes multiple times. You can see the script that we use to recreate our Solr core in GitHub.

We are still tweaking how we manage updates to our schema. For now we are using a low-tech approach in which we create small scripts to add fields to the schema that is conceptually similar to what Rails does with database migrations, but our approach is still very manual.

Default Field Definitions

The default field definitions in Solr 7 are different from the default field definitions in Solr 4, this is not surprising given that we skipped two major versions of Solr, but it was one one the hardest things to reconcile. Our Solr 4 was setup and configured many years ago and the upgrade forced us to look very close into exactly what kind of transformations we were doing to our data and decide what should be modified in Solr 7 to support the Solr 4 behavior versus what should be updated to use new Solr 7 features.

Our first approach was to manually inspect the “schema.xml” in Solr 4 and compare it with the “managed-schema” file in Solr 7 which is also an XML file. We soon found that this was too cumbersome and error prone. But we found the output of the LukeRequestHandler to be much more concise and easier to compare between the versions of Solr, and lucky us, the output of the LukeRequestHandler is identical in both versions of Solr!

Using the LukeRequestHandler we dumped our Solr schema to XML files and compare those files with a traditional file compare tool, we used the built-in file compare option in VS Code but any file compare tool would do.

These are the commands that we used to dump the schema to XML files:

curl http://solr-4-url/admin/luke?numTerms=0 > luke4.xml
curl http://solr-7-url/admin/luke?numTerms=0 > luke7.xml

The output of the LukeRequestHandler includes both the type of field (e.g. string) and the schema definition (single value vs multi-value, indexed, tokenized, et cetera.)

<lst name="title_display">
  <str name="type">string</str>
  <str name="schema">--SD------------l</str>
</lst>

Another benefit of using the LukeRequestHandler instead of going by the fields defined in schema.xml is that the LukeRequestHandler only outputs fields that are indeed used in the Solr core, whereas schema.xml lists fields that were used at one point even if we don’t use them anymore.

ICUFoldingFilter

In Solr 4 a few of the default field types used the ICUFoldingFilter which handles diacritics so that a word like “México” is equivalent to “Mexico”. This filter used to be available by default in a Solr 4 installation but that is not the case anymore. In Solr 7 ICUFoldingFilter is not enabled by default and you must edit your solrconfig.xml as indicated in the documentation to enable it (see previous link).

<lib dir="../../../contrib/analysis-extras/lib" regex="icu4j.*\.jar" />
<lib dir="../../../contrib/analysis-extras/lucene-libs" regex="lucene-analyzers-icu.*\.jar" />

and then you can use it in a field type by adding it as a filter:

curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field-type" : {
    "name":"text_search",
    "class":"solr.TextField",
    "analyzer" : {
       "tokenizer":{"class":"solr.StandardTokenizerFactory"},
       "filters":[
         {"class":"solr.ICUFoldingFilterFactory"},
         ...
     ]
   }
 }
}' $SOLR_CORE_URL/schema

Handle Select

HandleSelect is a parameter that is defined in the solrconfig.xml and in previous versions of Solr it used to default to true but starting in Solr 7 it defaults to false. The version of Blacklight that we are using (5.19) expects this value to be true.

This parameter is what allows Blacklight to use a request handler like “search” (without a leading slash) instead of “/search”. Enabling handleSelect is easy, just edit the requestDispatcher setting in the solrconfig.xml

<requestDispatcher handleSelect="true">

LocalParams and Dereferencing

Our current version of Blacklight uses LocalParams and Dereferencing heavily and support for these two features changed drastically in Solr 7.2. This is a good enhancement in Solr but it caught us by surprise.

The gist of the problem is that if the solrconfig.xml sets the query parser to DisMax or eDisMax then Solr will not recognize a query like this:

{!qf=$title_qf}

We tried several workarounds and settled on setting the default parser (defType) in solrconfig.xml to Lucene and requesting eDisMax explicitly from the client application:

{!type=dismax qf=$title_qff}Coffee&df=id

It’s worth nothing that passing defType as a normal query string parameter to change the parser did not work for us for queries using LocalParams and Dereferencing.

Stop words

One of the settings that we changed in our new field definitions was the use of stop words. We are now not using stop words when indexing title fields. This was one of the benefits of us doing a full review of each one of our field types and tweak them during the upgrade. The result is that now searches for titles that are only stop words (like “There there”) return the expected results.

Validating Results

To validate that our new field definitions and server side configuration in Solr 7 were compatible with that we had in Solr 4 we did several kinds of tests, some of them manual and others automated.

We have small suite of unit tests that Jeanette Norris and Ted Lawless created years ago and that we still use to validate some well known scenarios that we want to support. You can see those “relevancy” tests in our GitHub repository.

We also captured thousands of live searches from our Discovery layer using Solr 4 and replayed them with Solr 7 to make sure that the results of both systems were compatible. To determine that results were compatible we counted how many of the top 10 results, top 5, and top 1 were included in the results of both Solr instances. The following picture shows an example of how the results looks like.

The code that we used to run the searches on both Solr and generate the table is on our GitHub repo.

CJK Searches

The main reason for us to upgrade from Solr 4 to Solr 7 was to add support for Chinese, Japanese, and Korean (CJK) searches. The way our Solr 4 index was created we did not support searches in these languages. In our Solr 7 core we are using the built-in CJK fields definitions and our results are much better. This will be the subject of future blog post. Stay tuned.

PyPI packages

Recently, we published two Python packages to PyPI: bdrxml and bdrcmodels. No one else is using those packages, as far as I know, and it takes some effort to put them up there, but there are benefits from publishing them.

Putting a package on PyPI makes it easier for other code we package up to depend on bdrxml. For our indexing package, we can switch from this:

‘bdrxml @ https://github.com/Brown-University-Library/bdrxml/archive/v1.0a1.zip#sha1=5802ed82ee80a9627657cbb222fe9c056f73ad2c’,

to this:

‘bdrxml>=1.0’,

in setup.py, which is simpler. This also lets us using Python’s package version checking to not pin bdrxml to just one version, which is helpful when we embed the indexing package in another project that may use a different version of bdrxml.

Publishing these first two packages also gave us experience, which will help if we publish more packages to PyPI.

New RIAMCO website

A few days ago we released a new version of the Rhode Island Archival and Manuscript Collections Online (RIAMCO) website. The new version is a brand new codebase. This post describes a few of the new features that we implemented as part of the rewrite and how we designed the system to support them.

The RIAMCO website hosts information about archival and manuscript collections in Rhode Island. These collections (also known as finding aids) are stored as XML files using the Encoded Archival Description (EAD) standard and indexed into Solr to allow for full text searching and filtering.

Look and feel

The overall look and feel of the RIAMCO site is heavily influenced by the work that the folks at the NYU Libraries did on their site. Like NYU’s site and Brown’s Discovery tool the RIAMCO site uses the typical facets on the left, content on the right style that is common in many library and archive websites.

Below a screenshot on how the main search page looks like:

Architecture

Our previous site was put together over many years and it involved many separate applications written in different languages: the frontend was written in PHP, the indexer in Java, and the admin tool in (Python/Django). During this rewrite we bundled the code for the frontend and the indexer into a single application written in Ruby on Rails. [As of September 13th, 2019 the Rails application also provides the admin interface.]

You can view a diagram of this architecture and few more notes about it on this document.

Indexing

Like the previous version of the site, we are using Solr to power the search feature of the site. However, in the previous version each collection was indexed as a single Solr document whereas in the new version we are splitting each collection into many Solr documents: one document to store the main collection information (scope, biographical info, call number, et cetera), plus one document for each item in the inventory of the collection.

This new indexing strategy significantly increased the number of Solr documents that we store. We went from from 1100+ Solr documents (one for each collection) to 300,000+ Solr documents (one for each item in the inventory of those collections).

The advantage of this approach is that now we can search and find items at a much granular level than we did before. For example, we can tell a user that we found a match on “Box HE-4 Folder 354” of the Harris Ephemera collection for their search on blue moon rather than just telling them that there is a match somewhere in the 25 boxes (3,000 folders) in the “Harris Ephemera” collection.

In order to keep the relationship between all the Solr documents for a given collection we are using an extra ead_id_s field to store the id of the collection that each document belongs to. If we have a collection “A” with three items in the inventory they will have the following information in Solr:

{id: "A", ead_id_s: "A"} // the main collection record
{id: "A-1", ead_id_s: "A"} // item 1 in the inventory
{id: "A-2", ead_id_s: "A"} // item 2 in the inventory
{id: "A-3", ead_id_s: "A"} // item 3 in the inventory

This structure allows us to use the Result Grouping feature in Solr to group results from a search into the appropriate collection. With this structure in place we can then show the results grouped by collection as you can see in the previous screenshot.

The code to index our EAD files into Solr is on the Ead class.

We had do add some extra logic to handle cases when a match is found only on a Solr document for an inventory item (but not on the main collection) so that we can also display the main collection information along the inventory information in the search results. The code for this is on the search_grouped() function of the Search class.

Hit highlighting

Another feature that we implemented on the new site is hit highlighting. Although this is a feature that Solr supports out of the box there is some extra coding that we had to do to structure the information in a way that makes sense to our users. In particular things get tricky when the hit was found in a multi value field or when Solr only returns a snippet of the original value in the highlights results. The logic that we wrote to handle this is on the SearchItem class.

Advanced Search

We also did an overhaul to the Advanced Search feature. The layout of the page is very typical (it follows the style used in most Blacklight applications) but the code behind it allows us to implement several new features. For example, we allow the user to select any value from the facets (not only one of the first 10 values for that facet) and to select more than one value from those facets.

We also added a “Check” button to show the user what kind of Boolean expression would be generated for the query that they have entered. Below is a screenshot of the results of the check syntax for a sample query.

There are several tweaks and optimizations that we would like to do on this page, for example, opening the facet by Format is quite slow and it could be optimized. Also, the code to parse the expression could be written to use a more standard Tokenizer/Parser structure. We’ll get to that later on… hopefully : )

Individual finding aids

Like on the previous version of the site, the rendering of individual finding aids is done by applying XSLT transformations to the XML with the finding aid data. We made a few tweaks to the XSLT to integrate them on the new site but the vast majority of the transformations came as-is from the previous site. You can see the XSLT files in our GitHub repo.

It’s interesting that GitHub reports that half of the code for the new site is XSLT: 49% XSLT, 24% HTML, and 24% Ruby. Keep in mind that these numbers do not take into account the Ruby on Rails code (which is massive.)

Source code

The source code for the new application is available in GitHub.

Acknowledgements

Although I wrote the code for the new site, there were plenty of people that helped me along the way in this implementation, in particular Karen Eberhart and Joe Mancino. Karen provided the specs for the new site, answered my many questions about the structure of EAD files, and suggested many improvements and tweaks to make the site better. Joe helped me find the code for the original site and indexer, and setup the environment for the new one.

Deploying with shiv

I recently watched a talk called “Containerless Django – Deploying without Docker”, by Peter Baumgartner. Peter lists some benefits of Docker: that it gives you a pipeline for getting code tested and deployed, the container adds some security to the app, state can be isolated in the container, and it lets you run the exact same code in development and production.

Peter also lists some drawbacks to Docker: it’s a lot of code that could slow things down or have bugs, docker artifacts can be relatively large, and it adds extra abstractions to the system (eg. filesystem, network). He argues that an ideal deployment would include downloading a binary, creating a configuration file, and running it (like one can do with compiled C or Go programs).

Peter describes a process of deploying Django apps by creating a zipapp using shiv and goodconf, and deploying it with systemd constraints that add to the security. He argues that this process achieves most of the benefits of Docker, but more simply, and that there’s a sweet spot for application size where this type of deploy is a good solution.

I decided to try using shiv with our image server Loris. I ran the shiv command “shiv -o loris.pyz .”, and I got the following error:

User “loris” and or group “loris” do(es) not exist.
Please create this user, e.g.:
`useradd -d /var/www/loris -s /sbin/false loris`

The issue is that in the Loris setup.py file, the install process not only checks for the loris user as shown in the error, but it also sets up directories on the filesystem (including setting the owner and permission, which requires root permissions). I submitted a PR to remove the filesystem setup from the Python package installation (and put it in a script the user can run), and hopefully in the future it will be easier to package up Loris and deploy it different ways.

Checksums

In the BDR, we calculate checksums automatically on ingest (Fedora 3 provides that functionality for us), so all new content binaries going into the BDR get a checksum, which we can go back and check later as needed.

We can also pass checksums into the BDR API, and then we verify that Fedora calculates the same checksum for the ingested file, which shows that the content wasn’t modified since the first checksum was calculated. We have only been able to use MD5 checksums, but we want to be able to use more checksum types. This isn’t a problem for Fedora, which can calculate multiple checksum types, such as MD5, SHA1, SHA256, and SHA512.

However, there is a complicating factor – if Fedora gets a checksum mismatch, by default it returns a 500 response code with no message, so we can’t tell whether it was a checksum mismatch or some other server error. Thanks to Ben Armintor, though, we found that we can update our Fedora configuration so it returns the Checksum Mismatch information.

Another issue in this process is that we use eulfedora (which doesn’t seem to be maintained anymore). If a checksum mismatch happens, it raises a DigitalObjectSaveFailure, but we want to know that there was a checksum mismatch. We forked eulfedora and exposed the checksum mismatch information. Now we can remove some extra code that we had in our APIs, since more functionality is handled in Fedora/eulfedora, and we can use multiple checksum types.

Exporting Django data

We recently had a couple cases where we wanted to dump the data out of a Django database. In the first case (“tracker”), we were shutting down a legacy application, but needed to preserve the data in a different form for users. In the second case (“deposits”), we were backing up some obsolete data before removing it from the database. We handled the processes in two different ways.

Tracker

For the tracker, we used an export script to extract the data. Here’s a modified version of the script:

def export_data():
    now = datetime.datetime.now()
    dir_name = 'data_%s_%s_%s' % (now.year, now.month, now.day)
    d = os.mkdir(dir_name)
    file_name = os.path.join(dir_name, 'tracker_items.dat')
    with open(file_name, 'wb') as f:
        f.write(u'\u241f'.join([
                    'project name',
                    'container identifier',
                    'container name',
                    'identifier',
                    'name',
                    'dimensions',
                    'note',
                    'create digital surrogate',
                    'qc digital surrogate',
                    'create metadata record',
                    'qc metadata record',
                    'create submission package']).encode('utf8'))
        f.write('\u241e'.encode('utf8'))
        for project in models.Project.objects.all():
            for container in project.container_set.all():
                print(container)
                for item in container.item_set.all():
                    data = u'\u241f'.join([
                        project.name.strip(),
                        container.identifier.strip(),
                        container.name.strip(),
                        item.identifier.strip(),
                        item.name.strip(),
                        item.dimensions.strip(),
                        item.note.strip()
                    ])
                    item_actions = u'\u241f'.join([str(item_action) for item_action in item.itemaction_set.all().order_by('id')])
                    line_data = u'%s\u241f%s\u241e' % (data, item_actions)
                    f.write(line_data.encode('utf8'))

As you can see, we looped through different Django models and pulled out fields, writing everything to a file. We used the Unicode Record and Unit Separators as delimiters. One advantage of using those is that your data can have commas, tabs, newlines, … and it won’t matter. You still don’t have to quote or escape anything.

Then we converted the data to a spreadsheet that users can view and search:

import openpyxl

workbook = openpyxl.Workbook()
worksheet = workbook.active

with open('tracker_items.dat', 'rb') as f:
    data = f.read()
    lines = data.decode('utf8').split('\u241e')
    print(len(lines))
    print(lines[0])
    print(lines[-1])
    for line in lines:
        fields = line.split('\u241f')
        worksheet.append(fields)
workbook.save('tracker_items.xlsx')

Deposits

For the deposits project, we just used the built-in Django dumpdata command:

python manage.py dumpdata -o data_20180727.dat

That output file could be used to load data back into a database if needed.

Searching for hierarchical data in Solr

Recently I had to index a dataset into Solr in which the original items had a hierarchical relationship among them. In processing this data I took some time to look into the ancestor_path and descendent_path features that Solr provides out of the box and see if and how they could help to issue searches based on the hierarchy of the data. This post elaborates on what I learned in the process.

Let’s start with some sample hierarchical data to illustrate the kind of relationship that I am describing in this post. Below is a short list of databases and programming languages organized by type.

Databases
  ├─ Relational
  │   ├─ MySQL
  │   └─ PostgreSQL
  └─ Document
      ├─ Solr
      └─ MongoDB
Programming Languages
  └─ Object Oriented
      ├─ Ruby
      └─ Python

For the purposes of this post I am going to index each individual item shown in the hierarchy, not just the children items. In other words I am going to create 11 Solr documents: one for “Databases”, another for “Relational”, another for “MySQL”, and so on.

Each document is saved with an id, a title, and a path. For example, the document for “Databases” is saved as:

{ 
  "id": "001", 
  "title_s": "Databases",
  "x_ancestor_path": "db",
  "x_descendent_path": "db" }

and the one for “MySQL” is saved as:

{ 
  "id": "003", 
  "title_s": "MySQL",
  "x_ancestor_path": "db/rel/mysql",
  "x_descendent_path": "db/rel/mysql" }

The x_ancestor_path and x_descendent_path fields in the JSON data represent the path for each of these documents in the hierarcy. For example, the top level “Databases” document uses the path “db” where the lowest level document “MySQL” uses “db/rel/mysql”. I am storing the exact same value on both fields so that later on we can see how each of them provides different features and addresses different use cases.

ancestor_path and descendent_path

The ancestor_path and descendent_path field types come predefined in Solr. Below is the definition of the descendent_path in a standard Solr 7 core:

$ curl http://localhost:8983/solr/your-core/schema/fieldtypes/descendent_path
{
  ...
  "indexAnalyzer":{
    "tokenizer":{ 
      "class":"solr.PathHierarchyTokenizerFactory", "delimiter":"/"}},
  "queryAnalyzer":{
    "tokenizer":{ 
      "class":"solr.KeywordTokenizerFactory"}}}}

Notice how it uses the PathHierarchyTokenizerFactory tokenizer when indexing values of this type and that it sets the delimiter property to /. This means that when values are indexed they will be split into individual tokens by this delimiter. For example the value “db/rel/mysql” will be split into “db”, “db/rel”, and “db/rel/mysql”. You can validate this in the Analysis Screen in the Solr Admin tool.

The ancestor_path field is the exact opposite, it uses the PathHierarchyTokenizerFactory at query time and the KeywordTokenizerFactory at index time.

There are also two dynamic field definitions *_descendent_path and *_ancestor_path that automatically create fields with these types. Hence the wonky x_descendent_path and x_ancestor_path field names that I am using in this demo.

Finding descendants

The descendent_path field definition in Solr can be used to find all the descendant documents in the hierarchy for a given path. For example, if I query for all documents where the descendant path is “db” (q=x_descendent_path:db) I should get all document in the “Databases” hierarchy, but not the ones under “Programming Languages”. For example:

$ curl "http://localhost:8983/solr/your-core/select?q=x_descendent_path:db&fl=id,title_s,x_descendent_path"
{
  ...
  "response":{"numFound":7,"start":0,"docs":[
  {
    "id":"001",
    "title_s":"Databases",
    "x_descendent_path":"db"},
  {
    "id":"002",
    "title_s":"Relational",
    "x_descendent_path":"db/rel"},
  {
    "id":"003",
    "title_s":"MySQL",
    "x_descendent_path":"db/rel/mysql"},
  {
    "id":"004",
    "title_s":"PostgreSQL",
    "x_descendent_path":"db/rel/pg"},
  {
    "id":"005",
    "title_s":"Document",
    "x_descendent_path":"db/doc"},
  {
    "id":"006",
    "title_s":"MongoDB",
    "x_descendent_path":"db/doc/mongo"},
  {
    "id":"007",
    "title_s":"Solr",
    "x_descendent_path":"db/doc/solr"}]
}}

Finding ancestors

The ancestor_path not surprisingly can be used to achieve the reverse. Given the path of a given document we can query Solr to find all its ancestors in the hierarchy. For example if I query Solr for the documents where x_ancestor_path is “db/doc/solr” (q=x_ancestor_path:db/doc/solr) I should get “Databases”, “Document”, and “Solr” as shown below:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&fl=id,title_s,x_ancestor_path"
{
  ...
  "response":{"numFound":3,"start":0,"docs":[
  {
    "id":"001",
    "title_s":"Databases",
    "x_ancestor_path":"db"},
  {
    "id":"005",
    "title_s":"Document",
    "x_ancestor_path":"db/doc"},
  {
    "id":"007",
    "title_s":"Solr",
    "x_ancestor_path":"db/doc/solr"}]
}}

If you are curious how this works internally, you could issue a query with debugQuery=true and look at how the query value “db/doc/solr” was parsed. Notice how Solr splits the query value by the / delimiter and uses something called SynonymQuery() to handle the individual values as synonyms:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&debugQuery=true"
{
  ...
  "debug":{
    "rawquerystring":"x_ancestor_path:db/doc/solr",
    "parsedquery":"SynonymQuery(Synonym(x_ancestor_path:db x_ancestor_path:db/doc x_ancestor_path:db/doc/solr))",
...
}

One little gotcha

Given that Solr is splitting the path values by the / delimiter and that we can see those values in the Analysis Screen (or when passing debugQuery=true) we might expect to be able to fetch those values from the document somehow. But that is not the case. The individual tokens are not stored in a way that you can fetch them, i.e. there is no way for us to fetch the individual “db”, “db/doc”, and “db/doc/solr” values when fetching document id “007”. In hindsight this is standard Solr behavior but something that threw me off initially.