Understanding scoring of documents in Solr – Brown University Library Digital Technologies

During the development of our new Researchers@Brown front-end I spent a fair amount of time looking at the results that Solr gives when users execute searches. Although I have always known that Solr uses a sophisticated algorithm to determine why a particular document matches a search and why a document ranks higher in the search result than others, I have never looked very close into the details on how this works.

This post is an introduction on how to interpret the ranking (scoring) that Solr reports for documents returned in a search.

Requesting Solr “explain” information

When submitting search request to Solr is possible to request debug information that clarifies how Solr interpreted the client request and information on how the score for each document was calculated for the given search terms.

To request this information you just need to pass debugQuery=true as a query string parameter to a normal Solr search request. For example:

$ curl "http://someserver/solr/collection1/select?q=alcohol&wt=json&debugQuery=true"

The response for this query will include debug information with a property called explain where the ranking of each of the documents is explained.

"debug": {
  ...
  "explain": {
    "id:1": "a lot of text here",
    "id:2": "a lot of text here",
    "id:3": "a lot of text here",
    "id:4": "a lot of text here",
    ...
  }
}

Raw “explain” output

Although Solr gives information about how the score for each document was calculated, the format that is uses to provide this information is horrendous. This is an example of how the explain information for a given document looks like:

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) [DefaultSimilarity], result of:\n
    4.409502E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n
), product of:\n
      2.2283042E-4 = queryWeight, product of:\n
        5.170344 = idf(docFreq=60, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      0.197886 = fieldWeight in 831, product of:\n
        2.4494898 = tf(freq=6.0), with freq of:\n
          6.0 = termFreq=6.0\n
        5.170344 = idf(docFreq=60, maxDocs=3949)\n
        0.015625 = fieldNorm(doc=831)\n
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) [DefaultSimilarity], result of:\n
    4.27615E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n), product of:\n
      2.1943514E-4 = queryWeight, product of:\n
        5.0915627 = idf(docFreq=65, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      0.1948708 = fieldWeight in 831, product of:\n
        2.4494898 = tf(freq=6.0), with freq of:\n
          6.0 = termFreq=6.0\n
        5.0915627 = idf(docFreq=65, maxDocs=3949)\n
        0.015625 = fieldNorm(doc=831)\n
  1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n
    1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n
), product of:\n
      0.1410609 = queryWeight, product of:\n
        400.0 = boost\n
        8.182606 = idf(docFreq=2, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      8.182606 = fieldWeight in 831, product of:\n
        1.0 = tf(freq=1.0), with freq of:\n
          1.0 = termFreq=1.0\n
        8.182606 = idf(docFreq=2, maxDocs=3949)\n
        1.0 = fieldNorm(doc=831)\n",

It’s unfortunately that the information comes in a format that is not easy to parse, but since it’s plain text, we can read it and analyze it.

Explaining “explain” information

If you look closely at the information in the previous text you’ll notice that Solr reports the score of a document as the maximum value (max of) from a set of other scores. For example, below is a simplified version of text (the ellipsis represent text that I suppressed):

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) ...
  ...
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ...
  ...
  1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) ...
  ..."

In this example, the score of 1.1542457 for the document with id http://vivo.brown.edu/individual/12345 was the maximum of three scores (4.409502E-5, 4.27615E-5, 1.1542457). Notice that the scores are in E-notation. If you look closely you’ll see that each of those scores is associated with a different field in Solr where the search term, alcohol, was found.

From the text above we can determine that the text alcohol was found on the ALLTEXTUNSTEMMED, ALLTEXT, and research_areas. Even more, we can also tell that for this particular search we are giving the research_areas field a boost of 400 which explains why that particular score was much higher than the rest.

The information that I omitted in the previous example provides a more granular explanation on how each of those individual field scores was calculated. For example, below is the detail for the research_areas field:

1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n
  1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n), product of:\n
    0.1410609 = queryWeight, product of:\n
      400.0 = boost\n
      8.182606 = idf(docFreq=2, maxDocs=3949)\n
      4.3097796E-5 = queryNorm\n
    8.182606 = fieldWeight in 831, product of:\n
      1.0 = tf(freq=1.0), with freq of:\n
        1.0 = termFreq=1.0\n
      8.182606 = idf(docFreq=2, maxDocs=3949)\n
      1.0 = fieldNorm(doc=831)\n",

Again, if we look closely at this text we see that the score of 1.1542457 for the research_areas field was the product of two other factors (0.1410609 x 8.182606). There is even information about how these individual factors were calculated. I will not go into details on them in this blog post but if you are interested this is a good place to start.

Another interesting clue that Solr provides in the explain information is what values were searched for in a given field. For example, if I search for the word alcoholism (instead of alcohol) the Solr explain result would show that in one of the fields it used the stemmed version of the search term and in other it used the original text. In our example, this would look more or less like this:

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcoholism in 831) ...
  ...
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ...
  ...

Notice how in the unstemmed field (ALLTEXTUNSTEMMED) Solr used the exact word searched (alcoholism) whereas in the stemmed field (ALLTEXT) it used the stemmed version (alcohol). This is very useful to know if you were wondering why a value was (or was not) found in a given field. Likewise, if you are using (query time) synonyms those will show in the Solr explain results.

Live examples

In our new Researchers@Brown site we have an option to show the explain information from Solr. This option is intended for developers to troubleshoot tricky queries, not for the average user.

For example, if you pass explain=text to a search URL in the site you’ll get the text of the Solr explain output formatted for each of the results (scroll to the very bottom of the page to see the explain results).

Likewise, if you pass explain=matches to a search URL the response will include only the values of the matches that Solr evaluated (along with the field and boost value) for each document.

Source code

If you are interested in the code that we use to parse the Solr explain results you can find it in our GitHub repo. The code for this lives in two classes Explainer and ExplainerEntry.

Explainer takes a Solr response and creates an array with the explain information for each result. This array is comprised of ExplainEntry objects that in turn parse each of the results to make the match information easily accessible. Keep in mind that this code does mostly string parsing and therefore is rather brittle. For example, the code to extract the matches for a given document is as follows:

class ExplainEntry
  ...
  def get_matches(text)
    lines = text.split("\n")
    lines.select {|l| l.include?("(MATCH)") || l.include?("coord(")}
  end
end

As you can imagine, if Solr changes the structure of the text that it returns in the explain results this code will break. I get the impression that this format (as ugly as it is) has been stable in many versions of Solr so hopefully we won’t have many issues with this implementation.