Searching for hierarchical data in Solr – Brown University Library Digital Technologies

Recently I had to index a dataset into Solr in which the original items had a hierarchical relationship among them. In processing this data I took some time to look into the ancestor_path and descendent_path features that Solr provides out of the box and see if and how they could help to issue searches based on the hierarchy of the data. This post elaborates on what I learned in the process.

Let’s start with some sample hierarchical data to illustrate the kind of relationship that I am describing in this post. Below is a short list of databases and programming languages organized by type.

Databases
  ├─ Relational
  │   ├─ MySQL
  │   └─ PostgreSQL
  └─ Document
      ├─ Solr
      └─ MongoDB
Programming Languages
  └─ Object Oriented
      ├─ Ruby
      └─ Python

For the purposes of this post I am going to index each individual item shown in the hierarchy, not just the children items. In other words I am going to create 11 Solr documents: one for “Databases”, another for “Relational”, another for “MySQL”, and so on.

Each document is saved with an id, a title, and a path. For example, the document for “Databases” is saved as:

{ 
  "id": "001", 
  "title_s": "Databases",
  "x_ancestor_path": "db",
  "x_descendent_path": "db" }

and the one for “MySQL” is saved as:

{ 
  "id": "003", 
  "title_s": "MySQL",
  "x_ancestor_path": "db/rel/mysql",
  "x_descendent_path": "db/rel/mysql" }

The x_ancestor_path and x_descendent_path fields in the JSON data represent the path for each of these documents in the hierarcy. For example, the top level “Databases” document uses the path “db” where the lowest level document “MySQL” uses “db/rel/mysql”. I am storing the exact same value on both fields so that later on we can see how each of them provides different features and addresses different use cases.

ancestor_path and descendent_path

The ancestor_path and descendent_path field types come predefined in Solr. Below is the definition of the descendent_path in a standard Solr 7 core:

$ curl http://localhost:8983/solr/your-core/schema/fieldtypes/descendent_path
{
  ...
  "indexAnalyzer":{
    "tokenizer":{ 
      "class":"solr.PathHierarchyTokenizerFactory", "delimiter":"/"}},
  "queryAnalyzer":{
    "tokenizer":{ 
      "class":"solr.KeywordTokenizerFactory"}}}}

Notice how it uses the PathHierarchyTokenizerFactory tokenizer when indexing values of this type and that it sets the delimiter property to /. This means that when values are indexed they will be split into individual tokens by this delimiter. For example the value “db/rel/mysql” will be split into “db”, “db/rel”, and “db/rel/mysql”. You can validate this in the Analysis Screen in the Solr Admin tool.

The ancestor_path field is the exact opposite, it uses the PathHierarchyTokenizerFactory at query time and the KeywordTokenizerFactory at index time.

There are also two dynamic field definitions *_descendent_path and *_ancestor_path that automatically create fields with these types. Hence the wonky x_descendent_path and x_ancestor_path field names that I am using in this demo.

Finding descendants

The descendent_path field definition in Solr can be used to find all the descendant documents in the hierarchy for a given path. For example, if I query for all documents where the descendant path is “db” (q=x_descendent_path:db) I should get all document in the “Databases” hierarchy, but not the ones under “Programming Languages”. For example:

$ curl "http://localhost:8983/solr/your-core/select?q=x_descendent_path:db&fl=id,title_s,x_descendent_path"
{
  ...
  "response":{"numFound":7,"start":0,"docs":[
  {
    "id":"001",
    "title_s":"Databases",
    "x_descendent_path":"db"},
  {
    "id":"002",
    "title_s":"Relational",
    "x_descendent_path":"db/rel"},
  {
    "id":"003",
    "title_s":"MySQL",
    "x_descendent_path":"db/rel/mysql"},
  {
    "id":"004",
    "title_s":"PostgreSQL",
    "x_descendent_path":"db/rel/pg"},
  {
    "id":"005",
    "title_s":"Document",
    "x_descendent_path":"db/doc"},
  {
    "id":"006",
    "title_s":"MongoDB",
    "x_descendent_path":"db/doc/mongo"},
  {
    "id":"007",
    "title_s":"Solr",
    "x_descendent_path":"db/doc/solr"}]
}}

Finding ancestors

The ancestor_path not surprisingly can be used to achieve the reverse. Given the path of a given document we can query Solr to find all its ancestors in the hierarchy. For example if I query Solr for the documents where x_ancestor_path is “db/doc/solr” (q=x_ancestor_path:db/doc/solr) I should get “Databases”, “Document”, and “Solr” as shown below:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&fl=id,title_s,x_ancestor_path"
{
  ...
  "response":{"numFound":3,"start":0,"docs":[
  {
    "id":"001",
    "title_s":"Databases",
    "x_ancestor_path":"db"},
  {
    "id":"005",
    "title_s":"Document",
    "x_ancestor_path":"db/doc"},
  {
    "id":"007",
    "title_s":"Solr",
    "x_ancestor_path":"db/doc/solr"}]
}}

If you are curious how this works internally, you could issue a query with debugQuery=true and look at how the query value “db/doc/solr” was parsed. Notice how Solr splits the query value by the / delimiter and uses something called SynonymQuery() to handle the individual values as synonyms:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&debugQuery=true"
{
  ...
  "debug":{
    "rawquerystring":"x_ancestor_path:db/doc/solr",
    "parsedquery":"SynonymQuery(Synonym(x_ancestor_path:db x_ancestor_path:db/doc x_ancestor_path:db/doc/solr))",
...
}

One little gotcha

Given that Solr is splitting the path values by the / delimiter and that we can see those values in the Analysis Screen (or when passing debugQuery=true) we might expect to be able to fetch those values from the document somehow. But that is not the case. The individual tokens are not stored in a way that you can fetch them, i.e. there is no way for us to fetch the individual “db”, “db/doc”, and “db/doc/solr” values when fetching document id “007”. In hindsight this is standard Solr behavior but something that threw me off initially.