Brown University Homepage Brown University Library

Upgrading from Solr 4 to Solr 7

A few weeks ago we upgraded the version of Solr that we use in our Discovery layer, we went from Solr 4.9 to Solr 7.5. Although we have been using Solr 7.x in other areas of the library this was a significant upgrade for us because searching is the raison d’être of our Discovery layer and we wanted to make sure that the search results did not change in unexpected ways with the new field and server configurations in Solr. All in all the process went smooth for our users. This blog post elaborates on some of the things that we had to do in order to upgrade.

Managed Schema

This is the first Solr that we setup to use the managed-schema feature in Solr. This allows us to define field types and fields via the Schema API rather than by editing XML files. All in all this was a good decision and it allows us to recreate our Solr instances by running a shell script rather than by copying XML files. This feature was very handy during testing when we needed to recreate our Solr core for testing purposes multiple times. You can see the script that we use to recreate our Solr core in GitHub.

We are still tweaking how we manage updates to our schema. For now we are using a low-tech approach in which we create small scripts to add fields to the schema that is conceptually similar to what Rails does with database migrations, but our approach is still very manual.

Default Field Definitions

The default field definitions in Solr 7 are different from the default field definitions in Solr 4, this is not surprising given that we skipped two major versions of Solr, but it was one one the hardest things to reconcile. Our Solr 4 was setup and configured many years ago and the upgrade forced us to look very close into exactly what kind of transformations we were doing to our data and decide what should be modified in Solr 7 to support the Solr 4 behavior versus what should be updated to use new Solr 7 features.

Our first approach was to manually inspect the “schema.xml” in Solr 4 and compare it with the “managed-schema” file in Solr 7 which is also an XML file. We soon found that this was too cumbersome and error prone. But we found the output of the LukeRequestHandler to be much more concise and easier to compare between the versions of Solr, and lucky us, the output of the LukeRequestHandler is identical in both versions of Solr!

Using the LukeRequestHandler we dumped our Solr schema to XML files and compare those files with a traditional file compare tool, we used the built-in file compare option in VS Code but any file compare tool would do.

These are the commands that we used to dump the schema to XML files:

curl http://solr-4-url/admin/luke?numTerms=0 > luke4.xml
curl http://solr-7-url/admin/luke?numTerms=0 > luke7.xml

The output of the LukeRequestHandler includes both the type of field (e.g. string) and the schema definition (single value vs multi-value, indexed, tokenized, et cetera.) 

<lst name="title_display">
  <str name="type">string</str>
  <str name="schema">--SD------------l</str>
</lst>

Another benefit of using the LukeRequestHandler instead of going by the fields defined in schema.xml is that the LukeRequestHandler only outputs fields that are indeed used in the Solr core, whereas schema.xml lists fields that were used at one point even if we don’t use them anymore.

ICUFoldingFilter

In Solr 4 a few of the default field types used the ICUFoldingFilter which handles diacritics so that a word like “México” is equivalent to “Mexico”. This filter used to be available by default in a Solr 4 installation but that is not the case anymore. In Solr 7 ICUFoldingFilter is not enabled by default and you must edit your solrconfig.xml as indicated in the documentation to enable it (see previous link).

<lib dir="../../../contrib/analysis-extras/lib" regex="icu4j.*\.jar" />
<lib dir="../../../contrib/analysis-extras/lucene-libs" regex="lucene-analyzers-icu.*\.jar" />

and then you can use it in a field type by adding it as a filter:

curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field-type" : {
    "name":"text_search",
    "class":"solr.TextField",
    "analyzer" : {
       "tokenizer":{"class":"solr.StandardTokenizerFactory"},
       "filters":[
         {"class":"solr.ICUFoldingFilterFactory"},
         ...
     ]
   }
 }
}' $SOLR_CORE_URL/schema

Handle Select

HandleSelect is a parameter that is defined in the solrconfig.xml and in previous versions of Solr it used to default to true but starting in Solr 7 it defaults to false. The version of Blacklight that we are using (5.19) expects this value to be true.

This parameter is what allows Blacklight to use a request handler like “search” (without a leading slash) instead of “/search”. Enabling handleSelect is easy, just edit the requestDispatcher setting in the solrconfig.xml

<requestDispatcher handleSelect="true">

LocalParams and Dereferencing

Our current version of Blacklight uses LocalParams and Dereferencing heavily and support for these two features changed drastically in Solr 7.2. This is a good enhancement in Solr but it caught us by surprise. 

The gist of the problem is that if the solrconfig.xml sets the query parser to DisMax or eDisMax then Solr will not recognize a query like this: 

{!qf=$title_qf}

We tried several workarounds and settled on setting the default parser (defType) in solrconfig.xml to Lucene and requesting eDisMax explicitly from the client application:

{!type=dismax qf=$title_qff}Coffee&df=id

It’s worth nothing that passing defType as a normal query string parameter to change the parser did not work for us for queries using LocalParams and Dereferencing. 

Stop words

One of the settings that we changed in our new field definitions was the use of stop words. We are now not using stop words when indexing title fields. This was one of the benefits of us doing a full review of each one of our field types and tweak them during the upgrade. The result is that now searches for titles that are only stop words (like “There there”) return the expected results.

Validating Results

To validate that our new field definitions and server side configuration in Solr 7 were compatible with that we had in Solr 4 we did several kinds of tests, some of them manual and others automated.

We have small suite of unit tests that Jeanette Norris and Ted Lawless created years ago and that we still use to validate some well known scenarios that we want to support. You can see those “relevancy” tests in our GitHub repository.

We also captured thousands of live searches from our Discovery layer using Solr 4 and replayed them with Solr 7 to make sure that the results of both systems were compatible. To determine that results were compatible we counted how many of the top 10 results, top 5, and top 1 were included in the results of both Solr instances. The following picture shows an example of how the results looks like.

Search results comparison

The code that we used to run the searches on both Solr and generate the table is on our GitHub repo.

CJK Searches

The main reason for us to upgrade from Solr 4 to Solr 7 was to add support for Chinese, Japanese, and Korean (CJK) searches. The way our Solr 4 index was created we did not support searches in these languages. In our Solr 7 core we are using the built-in CJK fields definitions and our results are much better. This will be the subject of future blog post. Stay tuned.

New RIAMCO website

A few days ago we released a new version of the Rhode Island Archival and Manuscript Collections Online (RIAMCO) website. The new version is a brand new codebase. This post describes a few of the new features that we implemented as part of the rewrite and how we designed the system to support them.

The RIAMCO website hosts information about archival and manuscript collections in Rhode Island. These collections (also known as finding aids) are stored as XML files using the Encoded Archival Description (EAD) standard and indexed into Solr to allow for full text searching and filtering.

Look and feel

The overall look and feel of the RIAMCO site is heavily influenced by the work that the folks at the NYU Libraries did on their site. Like NYU’s site and Brown’s Discovery tool the RIAMCO site uses the typical facets on the left, content on the right style that is common in many library and archive websites.

Below a screenshot on how the main search page looks like:

Search results

Architecture

Our previous site was put together over many years and it involved many separate applications written in different languages: the frontend was written in PHP, the indexer in Java, and the admin tool in (Python/Django). During this rewrite we bundled the code for the frontend and the indexer into a single application written in Ruby on Rails. [As of September 13th, 2019 the Rails application also provides the admin interface.]

You can view a diagram of this architecture and few more notes about it on this document.

Indexing

Like the previous version of the site, we are using Solr to power the search feature of the site. However, in the previous version each collection was indexed as a single Solr document whereas in the new version we are splitting each collection into many Solr documents: one document to store the main collection information (scope, biographical info, call number, et cetera), plus one document for each item in the inventory of the collection.

This new indexing strategy significantly increased the number of Solr documents that we store. We went from from 1100+ Solr documents (one for each collection) to 300,000+ Solr documents (one for each item in the inventory of those collections).

The advantage of this approach is that now we can search and find items at a much granular level than we did before. For example, we can tell a user that we found a match on “Box HE-4 Folder 354” of the Harris Ephemera collection for their search on blue moon rather than just telling them that there is a match somewhere in the 25 boxes (3,000 folders) in the “Harris Ephemera” collection.

In order to keep the relationship between all the Solr documents for a given collection we are using an extra ead_id_s field to store the id of the collection that each document belongs to. If we have a collection “A” with three items in the inventory they will have the following information in Solr:

{id: "A", ead_id_s: "A"} // the main collection record
{id: "A-1", ead_id_s: "A"} // item 1 in the inventory
{id: "A-2", ead_id_s: "A"} // item 2 in the inventory
{id: "A-3", ead_id_s: "A"} // item 3 in the inventory

This structure allows us to use the Result Grouping feature in Solr to group results from a search into the appropriate collection. With this structure in place we can then show the results grouped by collection as you can see in the previous screenshot.

The code to index our EAD files into Solr is on the Ead class.

We had do add some extra logic to handle cases when a match is found only on a Solr document for an inventory item (but not on the main collection) so that we can also display the main collection information along the inventory information in the search results. The code for this is on the search_grouped() function of the Search class.

Hit highlighting

Another feature that we implemented on the new site is hit highlighting. Although this is a feature that Solr supports out of the box there is some extra coding that we had to do to structure the information in a way that makes sense to our users. In particular things get tricky when the hit was found in a multi value field or when Solr only returns a snippet of the original value in the highlights results. The logic that we wrote to handle this is on the SearchItem class.

Advanced Search

We also did an overhaul to the Advanced Search feature. The layout of the page is very typical (it follows the style used in most Blacklight applications) but the code behind it allows us to implement several new features. For example, we allow the user to select any value from the facets (not only one of the first 10 values for that facet) and to select more than one value from those facets.

We also added a “Check” button to show the user what kind of Boolean expression would be generated for the query that they have entered. Below is a screenshot of the results of the check syntax for a sample query.

advanced search

There are several tweaks and optimizations that we would like to do on this page, for example, opening the facet by Format is quite slow and it could be optimized. Also, the code to parse the expression could be written to use a more standard Tokenizer/Parser structure. We’ll get to that later on… hopefully : )

Individual finding aids

Like on the previous version of the site, the rendering of individual finding aids is done by applying XSLT transformations to the XML with the finding aid data. We made a few tweaks to the XSLT to integrate them on the new site but the vast majority of the transformations came as-is from the previous site. You can see the XSLT files in our GitHub repo.

It’s interesting that GitHub reports that half of the code for the new site is XSLT: 49% XSLT, 24% HTML, and 24% Ruby. Keep in mind that these numbers do not take into account the Ruby on Rails code (which is massive.)

GitHub code stats

Source code

The source code for the new application is available in GitHub.

Acknowledgements

Although I wrote the code for the new site, there were plenty of people that helped me along the way in this implementation, in particular Karen Eberhart and Joe Mancino. Karen provided the specs for the new site, answered my many questions about the structure of EAD files, and suggested many improvements and tweaks to make the site better. Joe helped me find the code for the original site and indexer, and setup the environment for the new one.

Searching for hierarchical data in Solr

Recently I had to index a dataset into Solr in which the original items had a hierarchical relationship among them. In processing this data I took some time to look into the ancestor_path and descendent_path features that Solr provides out of the box and see if and how they could help to issue searches based on the hierarchy of the data. This post elaborates on what I learned in the process.

Let’s start with some sample hierarchical data to illustrate the kind of relationship that I am describing in this post. Below is a short list of databases and programming languages organized by type.

Databases
  ├─ Relational
  │   ├─ MySQL
  │   └─ PostgreSQL
  └─ Document
      ├─ Solr
      └─ MongoDB
Programming Languages
  └─ Object Oriented
      ├─ Ruby
      └─ Python

For the purposes of this post I am going to index each individual item shown in the hierarchy, not just the children items. In other words I am going to create 11 Solr documents: one for “Databases”, another for “Relational”, another for “MySQL”, and so on.

Each document is saved with an id, a title, and a path. For example, the document for “Databases” is saved as:

{ 
  "id": "001", 
  "title_s": "Databases",
  "x_ancestor_path": "db",
  "x_descendent_path": "db" }

and the one for “MySQL” is saved as:

{ 
  "id": "003", 
  "title_s": "MySQL",
  "x_ancestor_path": "db/rel/mysql",
  "x_descendent_path": "db/rel/mysql" }

The x_ancestor_path and x_descendent_path fields in the JSON data represent the path for each of these documents in the hierarcy. For example, the top level “Databases” document uses the path “db” where the lowest level document “MySQL” uses “db/rel/mysql”. I am storing the exact same value on both fields so that later on we can see how each of them provides different features and addresses different use cases.

ancestor_path and descendent_path

The ancestor_path and descendent_path field types come predefined in Solr. Below is the definition of the descendent_path in a standard Solr 7 core:

$ curl http://localhost:8983/solr/your-core/schema/fieldtypes/descendent_path
{
  ...
  "indexAnalyzer":{
    "tokenizer":{ 
      "class":"solr.PathHierarchyTokenizerFactory", "delimiter":"/"}},
  "queryAnalyzer":{
    "tokenizer":{ 
      "class":"solr.KeywordTokenizerFactory"}}}}

Notice how it uses the PathHierarchyTokenizerFactory tokenizer when indexing values of this type and that it sets the delimiter property to /. This means that when values are indexed they will be split into individual tokens by this delimiter. For example the value “db/rel/mysql” will be split into “db”, “db/rel”, and “db/rel/mysql”. You can validate this in the Analysis Screen in the Solr Admin tool.

The ancestor_path field is the exact opposite, it uses the PathHierarchyTokenizerFactory at query time and the KeywordTokenizerFactory at index time.

There are also two dynamic field definitions *_descendent_path and *_ancestor_path that automatically create fields with these types. Hence the wonky x_descendent_path and x_ancestor_path field names that I am using in this demo.

Finding descendants

The descendent_path field definition in Solr can be used to find all the descendant documents in the hierarchy for a given path. For example, if I query for all documents where the descendant path is “db” (q=x_descendent_path:db) I should get all document in the “Databases” hierarchy, but not the ones under “Programming Languages”. For example:

$ curl "http://localhost:8983/solr/your-core/select?q=x_descendent_path:db&fl=id,title_s,x_descendent_path"
{
  ...
  "response":{"numFound":7,"start":0,"docs":[
  {
    "id":"001",
    "title_s":"Databases",
    "x_descendent_path":"db"},
  {
    "id":"002",
    "title_s":"Relational",
    "x_descendent_path":"db/rel"},
  {
    "id":"003",
    "title_s":"MySQL",
    "x_descendent_path":"db/rel/mysql"},
  {
    "id":"004",
    "title_s":"PostgreSQL",
    "x_descendent_path":"db/rel/pg"},
  {
    "id":"005",
    "title_s":"Document",
    "x_descendent_path":"db/doc"},
  {
    "id":"006",
    "title_s":"MongoDB",
    "x_descendent_path":"db/doc/mongo"},
  {
    "id":"007",
    "title_s":"Solr",
    "x_descendent_path":"db/doc/solr"}]
}}

Finding ancestors

The ancestor_path not surprisingly can be used to achieve the reverse. Given the path of a given document we can query Solr to find all its ancestors in the hierarchy. For example if I query Solr for the documents where x_ancestor_path is “db/doc/solr” (q=x_ancestor_path:db/doc/solr) I should get “Databases”, “Document”, and “Solr” as shown below:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&fl=id,title_s,x_ancestor_path"
{
  ...
  "response":{"numFound":3,"start":0,"docs":[
  {
    "id":"001",
    "title_s":"Databases",
    "x_ancestor_path":"db"},
  {
    "id":"005",
    "title_s":"Document",
    "x_ancestor_path":"db/doc"},
  {
    "id":"007",
    "title_s":"Solr",
    "x_ancestor_path":"db/doc/solr"}]
}}

If you are curious how this works internally, you could issue a query with debugQuery=true and look at how the query value “db/doc/solr” was parsed. Notice how Solr splits the query value by the / delimiter and uses something called SynonymQuery() to handle the individual values as synonyms:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&debugQuery=true"
{
  ...
  "debug":{
    "rawquerystring":"x_ancestor_path:db/doc/solr",
    "parsedquery":"SynonymQuery(Synonym(x_ancestor_path:db x_ancestor_path:db/doc x_ancestor_path:db/doc/solr))",
...
}

One little gotcha

Given that Solr is splitting the path values by the / delimiter and that we can see those values in the Analysis Screen (or when passing debugQuery=true) we might expect to be able to fetch those values from the document somehow. But that is not the case. The individual tokens are not stored in a way that you can fetch them, i.e. there is no way for us to fetch the individual “db”, “db/doc”, and “db/doc/solr” values when fetching document id “007”. In hindsight this is standard Solr behavior but something that threw me off initially.

Understanding scoring of documents in Solr

During the development of our new Researchers@Brown front-end I spent a fair amount of time looking at the results that Solr gives when users execute searches. Although I have always known that Solr uses a sophisticated algorithm to determine why a particular document matches a search and why a document ranks higher in the search result than others, I have never looked very close into the details on how this works.

This post is an introduction on how to interpret the ranking (scoring) that Solr reports for documents returned in a search.

Requesting Solr “explain” information

When submitting search request to Solr is possible to request debug information that clarifies how Solr interpreted the client request and information on how the score for each document was calculated for the given search terms.

To request this information you just need to pass debugQuery=true as a query string parameter to a normal Solr search request. For example:

$ curl "http://someserver/solr/collection1/select?q=alcohol&wt=json&debugQuery=true"

The response for this query will include debug information with a property called explain where the ranking of each of the documents is explained.

"debug": {
  ...
  "explain": {
    "id:1": "a lot of text here",
    "id:2": "a lot of text here",
    "id:3": "a lot of text here",
    "id:4": "a lot of text here",
    ...
  }
}

Raw “explain” output

Although Solr gives information about how the score for each document was calculated, the format that is uses to provide this information is horrendous. This is an example of how the explain information for a given document looks like:

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) [DefaultSimilarity], result of:\n
    4.409502E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n
), product of:\n
      2.2283042E-4 = queryWeight, product of:\n
        5.170344 = idf(docFreq=60, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      0.197886 = fieldWeight in 831, product of:\n
        2.4494898 = tf(freq=6.0), with freq of:\n
          6.0 = termFreq=6.0\n
        5.170344 = idf(docFreq=60, maxDocs=3949)\n
        0.015625 = fieldNorm(doc=831)\n
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) [DefaultSimilarity], result of:\n
    4.27615E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n), product of:\n
      2.1943514E-4 = queryWeight, product of:\n
        5.0915627 = idf(docFreq=65, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      0.1948708 = fieldWeight in 831, product of:\n
        2.4494898 = tf(freq=6.0), with freq of:\n
          6.0 = termFreq=6.0\n
        5.0915627 = idf(docFreq=65, maxDocs=3949)\n
        0.015625 = fieldNorm(doc=831)\n
  1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n
    1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n
), product of:\n
      0.1410609 = queryWeight, product of:\n
        400.0 = boost\n
        8.182606 = idf(docFreq=2, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      8.182606 = fieldWeight in 831, product of:\n
        1.0 = tf(freq=1.0), with freq of:\n
          1.0 = termFreq=1.0\n
        8.182606 = idf(docFreq=2, maxDocs=3949)\n
        1.0 = fieldNorm(doc=831)\n",

It’s unfortunately that the information comes in a format that is not easy to parse, but since it’s plain text, we can read it and analyze it.

Explaining “explain” information

If you look closely at the information in the previous text you’ll notice that Solr reports the score of a document as the maximum value (max of) from a set of other scores. For example, below is a simplified version of text (the ellipsis represent text that I suppressed):

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) ...
  ...
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ...
  ...
  1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) ...
  ..."

In this example, the score of 1.1542457 for the document with id http://vivo.brown.edu/individual/12345 was the maximum of three scores (4.409502E-5, 4.27615E-5, 1.1542457). Notice that the scores are in E-notation. If you look closely you’ll see that each of those scores is associated with a different field in Solr where the search term, alcohol, was found.

From the text above we can determine that the text alcohol was found on the ALLTEXTUNSTEMMED, ALLTEXT, and research_areas. Even more, we can also tell that for this particular search we are giving the research_areas field a boost of 400 which explains why that particular score was much higher than the rest.

The information that I omitted in the previous example provides a more granular explanation on how each of those individual field scores was calculated. For example, below is the detail for the research_areas field:

1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n
  1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n), product of:\n
    0.1410609 = queryWeight, product of:\n
      400.0 = boost\n
      8.182606 = idf(docFreq=2, maxDocs=3949)\n
      4.3097796E-5 = queryNorm\n
    8.182606 = fieldWeight in 831, product of:\n
      1.0 = tf(freq=1.0), with freq of:\n
        1.0 = termFreq=1.0\n
      8.182606 = idf(docFreq=2, maxDocs=3949)\n
      1.0 = fieldNorm(doc=831)\n",

Again, if we look closely at this text we see that the score of 1.1542457 for the research_areas field was the product of two other factors (0.1410609 x 8.182606). There is even information about how these individual factors were calculated. I will not go into details on them in this blog post but if you are interested this is a good place to start.

Another interesting clue that Solr provides in the explain information is what values were searched for in a given field. For example, if I search for the word alcoholism (instead of alcohol) the Solr explain result would show that in one of the fields it used the stemmed version of the search term and in other it used the original text. In our example, this would look more or less like this:

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcoholism in 831) ...
  ...
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ...
  ...

Notice how in the unstemmed field (ALLTEXTUNSTEMMED) Solr used the exact word searched (alcoholism) whereas in the stemmed field (ALLTEXT) it used the stemmed version (alcohol). This is very useful to know if you were wondering why a value was (or was not) found in a given field. Likewise, if you are using (query time) synonyms those will show in the Solr explain results.

Live examples

In our new Researchers@Brown site we have an option to show the explain information from Solr. This option is intended for developers to troubleshoot tricky queries, not for the average user.

For example, if you pass explain=text to a search URL in the site you’ll get the text of the Solr explain output formatted for each of the results (scroll to the very bottom of the page to see the explain results).

Likewise, if you pass explain=matches to a search URL the response will include only the values of the matches that Solr evaluated (along with the field and boost value) for each document.

Source code

If you are interested in the code that we use to parse the Solr explain results you can find it in our GitHub repo. The code for this lives in two classes Explainer and ExplainerEntry.

Explainer takes a Solr response and creates an array with the explain information for each result. This array is comprised of ExplainEntry objects that in turn parse each of the results to make the match information easily accessible. Keep in mind that this code does mostly string parsing and therefore is rather brittle. For example, the code to extract the matches for a given document is as follows:

class ExplainEntry
  ...
  def get_matches(text)
    lines = text.split("\n")
    lines.select {|l| l.include?("(MATCH)") || l.include?("coord(")}
  end
end

As you can imagine, if Solr changes the structure of the text that it returns in the explain results this code will break. I get the impression that this format (as ugly as it is) has been stable in many versions of Solr so hopefully we won’t have many issues with this implementation.

Using synonyms in Solr

A few days ago somebody reported that our catalog returns different results if a user searches for “music for the hundred years war” than if the user searches for “music for the 100 years war”.

To handle this issue I decided to use the synonyms feature in Solr. My thought was to tell Solr that “100” and “hundred” are synonyms and they should be treated as such. I had seen a synonyms.txt file in the Solr configuration folder and I thought it was just a matter of adding a few lines to this file and voilà synonyms will kick-in. It turns out using synonyms in Solr is a
bit more complicated than that, not too complicated, but not as straightforward as I had thought.

Configuring synonyms in Solr

To configure Solr to use synonyms you need to add a filter to the field type where you want synonyms to be used. For example, to enable synonyms for the text field in Solr I added a filter using the SynonymFilterFactory in our schema.xml

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.ICUFoldingFilterFactory" />
   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
   <filter class="solr.SnowballPorterFilterFactory" language="English" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.ICUFoldingFilterFactory" />
   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
   <filter class="solr.SnowballPorterFilterFactory" language="English" />
 </analyzer>
</fieldType>

You can add this filter for indexing, for querying, or both. In the example above I am only configuring the use of synonyms at query time.

Notice how the SynonymFilterFactory references a synonyms.txt file. This text file is where synonyms are defined. Notice also the expanded=true setting.

File synonyms.txt accepts the list of synonyms in two formats. The first format is just a list of words that are considered synonyms, for example:

 100,hundred

With this format every time Solr see “100” or “hundred” in a value it will automatically expand the value to include “100” and “hundred”. For example, if we were to search for “music for the hundred years war” it will actually search for “music for the 100 hundred years war”, notice how it now includes both variations (100 and hundred) in the text to search. The same will be true if we were to search for “music for the 100 years war”, Solr will search for both variations.

A second format we can use to configure synonyms is by using the => operator to consolidate various terms into a different term, for example:

 100 => hundred

With this format every time Solr sees “100” it will replace it with “hundred”. For example if we search for “music for the 100 years war” it will search for “music for the hundred years war”. Notice that in this case Solr will include “hundred” but drop “100”. The => in synonyms.txt is a shortcut to override the expand=true setting to replace the values on the left with the values on the right side.

Testing synonym matching in Solr

To see how synonyms are applied you can use the “Analysis” option available on the Solr dashboard page.

The following picture shows how this tool can be used to verify how Solr is handling synonyms at index time. Notice, in the highlighted rectangle, how “hundred” was indexed as both “hundred” and “100”.

Solr analysis screen (index)

We can also use this tool to see how values are handled at query time. The following picture shows how a query for “music for the 100 years war” is handled and matched to an original text “music for the hundred years war”. In this particular case synonyms are enabled in the Solr configuration only at query time which explains why the indexed value (on the left side) only has “hundred” but the value used at query time has been expanded to included both “100” and “hundred” which results in a match.

Solr analysis screen (query)

Index vs Query time

When configuring synonyms in Solr is important to consider the advantages and disadvantages of using them at index time, query time, or both.

Using synonyms at query time is easy because you don’t have to change your index to add or remove synonyms. You just add/remove lines from the synonyms.txt file, restart your Solr core, and the synonyms are applied in subsequent searches.

However, there are some benefits of using synonyms at index time particularly when you want to handle multi-term synonyms. This blog post by John Berryman and this page on the Apache documentation for Solr give a good explanation on why multi-term synonyms are tricky and why applying synonyms at index time might be a good idea. An obvious disadvantage of applying synonyms at index time is that you need to reindex your data for changes to the synonyms.txt to take effect.

Solr LocalParams and dereferencing

A few months ago, at the Blacklight Summit, I learned that Blacklight defines certain settings in solrconfig.xml to serve as shortcuts for a group of fields with different boost values. For
example, in our Blacklight installation we have a setting for author_qf that references four specific author fields with different boost values.

<str name="author_qf">
  author_unstem_search^200
  author_addl_unstem_search^50
  author_t^20
  author_addl_t
</str>

In this case author_qf is a shortcut that we use when issuing searches by author. By referencing author_qf in our request to Solr we don’t have to list all four author fields (author_unstem_search, author_addl_unstem_search, author_t, and author_addl_t) and their boost values, Solr is smart enough to use those four fields when it notices author_qf in the query. You can see the exact definition of this field in our GitHub repository.

Although the Blacklight project talks about this feature in their documentation page and our Blacklight instance takes advantage of it via the Blacklight Advanced Search plugin I had never really quite understood how this works internally in Solr.

LocalParams

Turns out Blacklight takes advantage of a feature in Solr called LocalParams. This feature allows us to customize individual values for a parameter on each request:

LocalParams stands for local parameters: they provide a way to “localize” information about a specific argument that is being sent to Solr. In other words, LocalParams provide a way to add meta-data to certain argument types such as query strings. https://wiki.apache.org/solr/LocalParams

The syntax for LocalParams is p={! k=v } where p is the parameter to localize, k is the setting to customize, and v the value for the setting. For example, the following

q={! qf=author}jane

uses LocalParams to customize the q parameter of a search. In this case it forces the query field qf parameter to use the author field when it searches for “jane”.

Dereferencing

When using LocalParams you can also use dereferencing to tell the parser to use an already defined value as the value for a LocalParam. For example, the following example shows how to use the already defined value (author_qf) when setting the value for the qf in the LocalParams. Notice how the value is prefixed with a dollar-sign to indicate dereferencing:

q={! qf=$author_qf}jane

When Solr sees the $author_qf it replaces it with the four author fields that we defined for it and sets the qf parameter to use the four author fields.

You can see how Solr handles dereferencing if you pass debugQuery=true to your Solr query and inspect the debug.parsedquery in the response. The previous query would return something along the lines of

(+DisjunctionMaxQuery(
    (
    author_t:jane^20.0 |
    author_addl_t:jane |
    author_addl_unstem_search:jane^50.0 |
    author_unstem_search:jane^200.0
    )~0.01
  )
)/no_coord

Notice how Solr dereferenced (i.e. expanded) author_qf to the four author fields that we have configured in our solrconfig.xml with the corresponding boost values.

It’s worth noticing that dereferencing only works if you use the eDisMax parser in Solr.

There are several advantages to using this Solr feature that come to mind. One is that your queries are a bit shorter since we are passing an alias (author_qf) rather than all four fields and their boost values, this makes reading the query a bit clearer. The second advantage is that you can change the definition for the author_qf field on the server (say to add include a new author field in your Solr index) and the client applications automatically will use the definition when you reference author_qf.