Upgrading from Solr 4 to Solr 7 – Brown University Library Digital Technologies

A few weeks ago we upgraded the version of Solr that we use in our Discovery layer, we went from Solr 4.9 to Solr 7.5. Although we have been using Solr 7.x in other areas of the library this was a significant upgrade for us because searching is the raison d’être of our Discovery layer and we wanted to make sure that the search results did not change in unexpected ways with the new field and server configurations in Solr. All in all the process went smooth for our users. This blog post elaborates on some of the things that we had to do in order to upgrade.

Managed Schema

This is the first Solr that we setup to use the managed-schema feature in Solr. This allows us to define field types and fields via the Schema API rather than by editing XML files. All in all this was a good decision and it allows us to recreate our Solr instances by running a shell script rather than by copying XML files. This feature was very handy during testing when we needed to recreate our Solr core for testing purposes multiple times. You can see the script that we use to recreate our Solr core in GitHub.

We are still tweaking how we manage updates to our schema. For now we are using a low-tech approach in which we create small scripts to add fields to the schema that is conceptually similar to what Rails does with database migrations, but our approach is still very manual.

Default Field Definitions

The default field definitions in Solr 7 are different from the default field definitions in Solr 4, this is not surprising given that we skipped two major versions of Solr, but it was one one the hardest things to reconcile. Our Solr 4 was setup and configured many years ago and the upgrade forced us to look very close into exactly what kind of transformations we were doing to our data and decide what should be modified in Solr 7 to support the Solr 4 behavior versus what should be updated to use new Solr 7 features.

Our first approach was to manually inspect the “schema.xml” in Solr 4 and compare it with the “managed-schema” file in Solr 7 which is also an XML file. We soon found that this was too cumbersome and error prone. But we found the output of the LukeRequestHandler to be much more concise and easier to compare between the versions of Solr, and lucky us, the output of the LukeRequestHandler is identical in both versions of Solr!

Using the LukeRequestHandler we dumped our Solr schema to XML files and compare those files with a traditional file compare tool, we used the built-in file compare option in VS Code but any file compare tool would do.

These are the commands that we used to dump the schema to XML files:

curl http://solr-4-url/admin/luke?numTerms=0 > luke4.xml
curl http://solr-7-url/admin/luke?numTerms=0 > luke7.xml

The output of the LukeRequestHandler includes both the type of field (e.g. string) and the schema definition (single value vs multi-value, indexed, tokenized, et cetera.)

<lst name="title_display">
  <str name="type">string</str>
  <str name="schema">--SD------------l</str>
</lst>

Another benefit of using the LukeRequestHandler instead of going by the fields defined in schema.xml is that the LukeRequestHandler only outputs fields that are indeed used in the Solr core, whereas schema.xml lists fields that were used at one point even if we don’t use them anymore.

ICUFoldingFilter

In Solr 4 a few of the default field types used the ICUFoldingFilter which handles diacritics so that a word like “México” is equivalent to “Mexico”. This filter used to be available by default in a Solr 4 installation but that is not the case anymore. In Solr 7 ICUFoldingFilter is not enabled by default and you must edit your solrconfig.xml as indicated in the documentation to enable it (see previous link).

<lib dir="../../../contrib/analysis-extras/lib" regex="icu4j.*\.jar" />
<lib dir="../../../contrib/analysis-extras/lucene-libs" regex="lucene-analyzers-icu.*\.jar" />

and then you can use it in a field type by adding it as a filter:

curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field-type" : {
    "name":"text_search",
    "class":"solr.TextField",
    "analyzer" : {
       "tokenizer":{"class":"solr.StandardTokenizerFactory"},
       "filters":[
         {"class":"solr.ICUFoldingFilterFactory"},
         ...
     ]
   }
 }
}' $SOLR_CORE_URL/schema

Handle Select

HandleSelect is a parameter that is defined in the solrconfig.xml and in previous versions of Solr it used to default to true but starting in Solr 7 it defaults to false. The version of Blacklight that we are using (5.19) expects this value to be true.

This parameter is what allows Blacklight to use a request handler like “search” (without a leading slash) instead of “/search”. Enabling handleSelect is easy, just edit the requestDispatcher setting in the solrconfig.xml

<requestDispatcher handleSelect="true">

LocalParams and Dereferencing

Our current version of Blacklight uses LocalParams and Dereferencing heavily and support for these two features changed drastically in Solr 7.2. This is a good enhancement in Solr but it caught us by surprise.

The gist of the problem is that if the solrconfig.xml sets the query parser to DisMax or eDisMax then Solr will not recognize a query like this:

{!qf=$title_qf}

We tried several workarounds and settled on setting the default parser (defType) in solrconfig.xml to Lucene and requesting eDisMax explicitly from the client application:

{!type=dismax qf=$title_qff}Coffee&df=id

It’s worth nothing that passing defType as a normal query string parameter to change the parser did not work for us for queries using LocalParams and Dereferencing.

Stop words

One of the settings that we changed in our new field definitions was the use of stop words. We are now not using stop words when indexing title fields. This was one of the benefits of us doing a full review of each one of our field types and tweak them during the upgrade. The result is that now searches for titles that are only stop words (like “There there”) return the expected results.

Validating Results

To validate that our new field definitions and server side configuration in Solr 7 were compatible with that we had in Solr 4 we did several kinds of tests, some of them manual and others automated.

We have small suite of unit tests that Jeanette Norris and Ted Lawless created years ago and that we still use to validate some well known scenarios that we want to support. You can see those “relevancy” tests in our GitHub repository.

We also captured thousands of live searches from our Discovery layer using Solr 4 and replayed them with Solr 7 to make sure that the results of both systems were compatible. To determine that results were compatible we counted how many of the top 10 results, top 5, and top 1 were included in the results of both Solr instances. The following picture shows an example of how the results looks like.

The code that we used to run the searches on both Solr and generate the table is on our GitHub repo.

CJK Searches

The main reason for us to upgrade from Solr 4 to Solr 7 was to add support for Chinese, Japanese, and Korean (CJK) searches. The way our Solr 4 index was created we did not support searches in these languages. In our Solr 7 core we are using the built-in CJK fields definitions and our results are much better. This will be the subject of future blog post. Stay tuned.