Blacklight

Monitoring Passenger’s Requests in Queue over time

As I mentioned in a previous post we use Phusion Passenger as the application server to host our Ruby applications. A while ago upon the recommendation of my coworker Ben Cail I created a cron job that calls passenger-status every 5 minutes to log the status of Passenger in our servers. Below is a sample of the passenger-status output:

Version : 5.1.12
Date : Mon Jul 30 10:42:54 -0400 2018
Instance: 8x6dq9uX (Apache/2.2.15 (Unix) DAV/2 Phusion_Passenger/5.1.12)

----------- General information -----------
Max pool size : 6
App groups : 1
Processes : 6
Requests in top-level queue : 0

----------- Application groups -----------
/path/to/our/app:
App root: /path/to/our/app
Requests in queue: 3
* PID: 43810 Sessions: 1 Processed: 20472 Uptime: 1d 7h 31m 25s
CPU: 0% Memory : 249M Last used: 1s ag
* PID: 2628 Sessions: 1 Processed: 1059 Uptime: 4h 34m 39s
CPU: 0% Memory : 138M Last used: 1s ago
* PID: 2838 Sessions: 1 Processed: 634 Uptime: 4h 30m 47s
CPU: 0% Memory : 134M Last used: 1s ago
* PID: 16836 Sessions: 1 Processed: 262 Uptime: 2h 14m 46s
CPU: 0% Memory : 160M Last used: 1s ago
* PID: 27431 Sessions: 1 Processed: 49 Uptime: 25m 27s
CPU: 0% Memory : 119M Last used: 0s ago
* PID: 27476 Sessions: 1 Processed: 37 Uptime: 25m 0s
CPU: 0% Memory : 117M Last used: 0s ago

Our cron job to log this information over time is something like this:

/path/to/.gem/gems/passenger-5.1.12/bin/passenger-status >> ./logs/passenger_status.log

Last week we had some issues in which our production server was experiencing short outages. Upon review we noticed that we were having a unusual amount of traffic coming to our server (most of it from crawlers submitting bad requests.) One of the tools that we used to validate the status of our server was the passenger_status.log file created via the aforementioned cron job.

The key piece of information that we use is the “Requests in queue” value highlighted above. We parsed this value of out the passenger_status.log file to see how it changed in the last 30 days. The result showed that although we have had a couple of outages recently the number of “requests in queue” dramatically increased about two weeks ago and it had stayed high ever since.

The graph below shows what we found. Notice how after August 19th the value of “requests in queue” has been constantly high, whereas before August 19th it was almost always zero or below 10.

We looked closely to our Apache and Rails logs and determined the traffic that was causing the problem. We took a few steps to handle it and now our servers are behaving as normal again. Notice how we are back to zero requests in queue on August 31st in the graph above.

The Ruby code that we use to parse our passenger_status.log file is pretty simple, it just grabs the line with the date and the line with the number of requests in queue, parses their values, and outputs the result to a tab delimited file that then we can use to create a graph in Excel or RAWGraphs. Below is the Ruby code:

require "date"

log_file = "passenger_status.log"
excel_date = true

def date_from_line(line, excel_date)
  index = line.index(":")
  return nil if index == nil
  date_as_text = line[index+2..-1].strip # Thu Aug 30 14:00:01 -0400 2018
  datetime = DateTime.parse(date_as_text).to_s # 2018-08-30T14:00:01-04:00
  if excel_date
    return datetime[0..9] + " " + datetime[11..15] # 2018-08-30 14:00
  end
  datetime
end

def count_from_line(line)
  return line.gsub("Requests in queue:", "").to_i
end

puts "timestamp\trequest_in_queue"
date = "N/A"
File.readlines(log_file).each do |line|
  if line.start_with?("Date ")
    date = date_from_line(line, excel_date)
  elsif line.include?("Requests in queue:")
    request_count = count_from_line(line)
    puts "\"#{date}\"\t#{request_count}"
  end
end

In this particular case the number of requests in queue was caused by bad/unwanted traffic. If the increase in traffic had been legitimate we would have taken a different route, like adding more processes to our Passenger instance to handle the traffic.

Configuring Ruby, Bundler, and Passenger

Recently we upgraded several of our applications to a newer version of Ruby which was relatively simple to do in our local development machines. However, we ran into complications once we started deploying the updated applications to our development and productions servers. The problems that we ran into highlighted issues in the way we had configured our applications and Passenger on the servers. This blog post elaborates on the final configurations that we arrived to (at least for now) and explains the rationale for the settings that worked for us.

Our setup

We deploy our applications using a “system account” (e.g. appuser) so that execution permissions and file ownership are not tied to the account of the developer doing the deployment.

We use Apache as our web server and Phusion Passenger as the application server to handle our Ruby applications.

And last but not least, we use Bundler to manage gems in our Ruby applications.

Application-level configuration

We perform all the steps to deploy a new version of our applications with the “system account” for the application (e.g. appuser.)

Since sometimes we have more than one version of Ruby in our servers we use chruby to switch between versions on the server when we are logged in as the appuser. However, we have learned that is better not to select a particular version of Ruby as part this user’s bash profile. Executing ruby -v as this user upon login will typical show the version that came with the operating system (e.g. “ruby 1.8.7”).

By leaving the system Ruby as the default we are forced to select the proper version of Ruby that we want on each application, this has the advantage that the configuration for each application is explicit on what version of Ruby it needs. This also makes applications less likely to break when we install a newer version of Ruby on the server. This is particularly useful in our development server where we have many Ruby applications running and each of them might be using a different version of Ruby.

If we want to do something for a particular application (say install gems or run a rake task) then we switch to the version of Ruby (via chruby) that we need for the application before executing the required commands.

We have also found useful to configure Bundler to install application gems inside the application folder rather than in a global folder. We do this via Bundler --path parameter. The only gem that we install globally (i.e. in GEM_HOME) is bundler.

A typical deployment script looks more or less like this.

$ ssh our-production-machine

Switch to our system account on the remote server (notice that it references the Ruby that came with the operating system):

$ su - appuser

$ ruby -v
# => ruby 1.8.7 (2013-06-27 patchlevel 374) [x86_64-linux]
 
$ which ruby
# => /usr/bin/ruby

Activate the version of Ruby that we want for this app (notice that it references the Ruby that we installed):

$ source /opt/local/chruby/share/chruby/chruby.sh
$ chruby ruby-2.3.6

$ ruby -v 
# => ruby 2.3.6p384 (2017-12-14 revision 61254) [x86_64-linux] 
 
$ which ruby
# => ~/rubies/ruby-2.3.6/bin/ruby
 
$ env | grep GEM
# => GEM_HOME=/opt/local/.gem/ruby/2.3.6
# => GEM_ROOT=/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0
# => GEM_PATH=/opt/local/.gem/ruby/2.3.6:/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0

Install bundler (this is only needed the first time, notice how it is installed in GEM_HOME):

$ gem install bundler
$ gem list bundler -d
# => Installed at: /opt/local/.gem/ruby/2.3.6

Install the rest of the app, its gems, and execute some rake tasks (notice that Bundler will indicate that gems are being installed locally to ./vendor/bundle):

$ cd /path/to/appOne
$ git pull

$ RAILS_ENV=production bundle install --path vendor/bundle
# => Bundled gems are installed into `./vendor/bundle`

$ RAILS_ENV=production bundle exec rake assets:precompile

Passenger configuration

Our default passenger configuration is rather bare-bones and indicates only a few settings. For example our /etc/httpd/conf.d/passenger.conf looks more or less like this:

LoadModule passenger_module /opt/local/.gem/gems/passenger-5.1.12/buildout/apache2/mod_passenger.so

<IfModule mod_passenger.c>
  PassengerRoot /opt/local/.gem/gems/passenger-5.1.12
  PassengerUser appuser
  PassengerStartTimeout 300
</IfModule>

Include /path/to/appOne/http/project_passenger.conf
Include /path/to/appTwo/http/project_passenger.conf

Notice that there are no specific Ruby settings indicated above. The Ruby specific settings are indicated on the individual project_passenger.conf files for each application.

If we look at the passenger config for one of the apps (say /path/to/appOne/http/project_passenger.conf) it would look more or less like this:

<Location /appOne>
  PassengerBaseURI /appOne
  PassengerRuby /opt/local/rubies/ruby-2.3.6/bin/ruby
  PassengerAppRoot /path/to/appOne/
  SetEnv GEM_PATH /opt/local/.gem/ruby/2.3.6/
</Location>

Notice that this configuration indicates both the path to the Ruby version that we want for this application (PassengerRuby) and also where to find (global) gems for this application (GEM_PATH).

The value for PassengerRuby matches the path that which ruby returned above (/opt/local/rubies/ruby-2.3.6/bin/ruby) and clearly indicates that we are using version 2.3.6 for this application.

The GEM_PATH settings is very important since this is what allows Passenger to find bundler when loading our application. Not setting this value results in the application not loading and Apache logging the following error:

Could not spawn process for application /path/to/AppOne: An error occurred while starting up the preloader.
Error ID: dd0dcbd4
Error details saved to: /tmp/passenger-error-3OKItz.html
Message from application: cannot load such file -- bundler/setup (LoadError)
/opt/local/rubies/ruby-2.3.6/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
/opt/local/rubies/ruby-2.3.6/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'

Notice that we set the GEM_PATH value to the path returned by gem list bundler -d above. This is a bit tricky since if you are looking closely we are setting GEM_PATH to the value that GEM_HOME reported above (/opt/local/.gem/ruby/2.3.6/). I suspect we could have set GEM_PATH to /opt/local/.gem/ruby/2.3.6:/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0 to match the GEM_PATH above but we didn’t try that.

UPDATE: The folks at Phusion recommend setting GEM_HOME as well (even if Passenger does not need it) because some gems might need it.

— Hector Correa & Joe Mancino

Using synonyms in Solr

A few days ago somebody reported that our catalog returns different results if a user searches for “music for the hundred years war” than if the user searches for “music for the 100 years war”.

To handle this issue I decided to use the synonyms feature in Solr. My thought was to tell Solr that “100” and “hundred” are synonyms and they should be treated as such. I had seen a synonyms.txt file in the Solr configuration folder and I thought it was just a matter of adding a few lines to this file and voilà synonyms will kick-in. It turns out using synonyms in Solr is a
bit more complicated than that, not too complicated, but not as straightforward as I had thought.

Configuring synonyms in Solr

To configure Solr to use synonyms you need to add a filter to the field type where you want synonyms to be used. For example, to enable synonyms for the text field in Solr I added a filter using the SynonymFilterFactory in our schema.xml

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.ICUFoldingFilterFactory" />
   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
   <filter class="solr.SnowballPorterFilterFactory" language="English" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.ICUFoldingFilterFactory" />
   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
   <filter class="solr.SnowballPorterFilterFactory" language="English" />
 </analyzer>
</fieldType>

You can add this filter for indexing, for querying, or both. In the example above I am only configuring the use of synonyms at query time.

Notice how the SynonymFilterFactory references a synonyms.txt file. This text file is where synonyms are defined. Notice also the expanded=true setting.

File synonyms.txt accepts the list of synonyms in two formats. The first format is just a list of words that are considered synonyms, for example:

 100,hundred

With this format every time Solr see “100” or “hundred” in a value it will automatically expand the value to include “100” and “hundred”. For example, if we were to search for “music for the hundred years war” it will actually search for “music for the 100 hundred years war”, notice how it now includes both variations (100 and hundred) in the text to search. The same will be true if we were to search for “music for the 100 years war”, Solr will search for both variations.

A second format we can use to configure synonyms is by using the => operator to consolidate various terms into a different term, for example:

 100 => hundred

With this format every time Solr sees “100” it will replace it with “hundred”. For example if we search for “music for the 100 years war” it will search for “music for the hundred years war”. Notice that in this case Solr will include “hundred” but drop “100”. The => in synonyms.txt is a shortcut to override the expand=true setting to replace the values on the left with the values on the right side.

Testing synonym matching in Solr

To see how synonyms are applied you can use the “Analysis” option available on the Solr dashboard page.

The following picture shows how this tool can be used to verify how Solr is handling synonyms at index time. Notice, in the highlighted rectangle, how “hundred” was indexed as both “hundred” and “100”.

We can also use this tool to see how values are handled at query time. The following picture shows how a query for “music for the 100 years war” is handled and matched to an original text “music for the hundred years war”. In this particular case synonyms are enabled in the Solr configuration only at query time which explains why the indexed value (on the left side) only has “hundred” but the value used at query time has been expanded to included both “100” and “hundred” which results in a match.

Index vs Query time

When configuring synonyms in Solr is important to consider the advantages and disadvantages of using them at index time, query time, or both.

Using synonyms at query time is easy because you don’t have to change your index to add or remove synonyms. You just add/remove lines from the synonyms.txt file, restart your Solr core, and the synonyms are applied in subsequent searches.

However, there are some benefits of using synonyms at index time particularly when you want to handle multi-term synonyms. This blog post by John Berryman and this page on the Apache documentation for Solr give a good explanation on why multi-term synonyms are tricky and why applying synonyms at index time might be a good idea. An obvious disadvantage of applying synonyms at index time is that you need to reindex your data for changes to the synonyms.txt to take effect.

Solr LocalParams and dereferencing

A few months ago, at the Blacklight Summit, I learned that Blacklight defines certain settings in solrconfig.xml to serve as shortcuts for a group of fields with different boost values. For
example, in our Blacklight installation we have a setting for author_qf that references four specific author fields with different boost values.

<str name="author_qf">
  author_unstem_search^200
  author_addl_unstem_search^50
  author_t^20
  author_addl_t
</str>

In this case author_qf is a shortcut that we use when issuing searches by author. By referencing author_qf in our request to Solr we don’t have to list all four author fields (author_unstem_search, author_addl_unstem_search, author_t, and author_addl_t) and their boost values, Solr is smart enough to use those four fields when it notices author_qf in the query. You can see the exact definition of this field in our GitHub repository.

Although the Blacklight project talks about this feature in their documentation page and our Blacklight instance takes advantage of it via the Blacklight Advanced Search plugin I had never really quite understood how this works internally in Solr.

LocalParams

Turns out Blacklight takes advantage of a feature in Solr called LocalParams. This feature allows us to customize individual values for a parameter on each request:

LocalParams stands for local parameters: they provide a way to “localize” information about a specific argument that is being sent to Solr. In other words, LocalParams provide a way to add meta-data to certain argument types such as query strings. https://wiki.apache.org/solr/LocalParams

The syntax for LocalParams is p={! k=v } where p is the parameter to localize, k is the setting to customize, and v the value for the setting. For example, the following

q={! qf=author}jane

uses LocalParams to customize the q parameter of a search. In this case it forces the query field qf parameter to use the author field when it searches for “jane”.

Dereferencing

When using LocalParams you can also use dereferencing to tell the parser to use an already defined value as the value for a LocalParam. For example, the following example shows how to use the already defined value (author_qf) when setting the value for the qf in the LocalParams. Notice how the value is prefixed with a dollar-sign to indicate dereferencing:

q={! qf=$author_qf}jane

When Solr sees the $author_qf it replaces it with the four author fields that we defined for it and sets the qf parameter to use the four author fields.

You can see how Solr handles dereferencing if you pass debugQuery=true to your Solr query and inspect the debug.parsedquery in the response. The previous query would return something along the lines of

(+DisjunctionMaxQuery(
    (
    author_t:jane^20.0 |
    author_addl_t:jane |
    author_addl_unstem_search:jane^50.0 |
    author_unstem_search:jane^200.0
    )~0.01
  )
)/no_coord

Notice how Solr dereferenced (i.e. expanded) author_qf to the four author fields that we have configured in our solrconfig.xml with the corresponding boost values.

It’s worth noticing that dereferencing only works if you use the eDisMax parser in Solr.

There are several advantages to using this Solr feature that come to mind. One is that your queries are a bit shorter since we are passing an alias (author_qf) rather than all four fields and their boost values, this makes reading the query a bit clearer. The second advantage is that you can change the definition for the author_qf field on the server (say to add include a new author field in your Solr index) and the client applications automatically will use the definition when you reference author_qf.

Search relevancy tests

We are creating a set of relevancy tests for the library’s Blacklight implementation. These tests use predetermined phrases to search Solr, Blacklight’s backend, mimicking the results a user would retrieve. This provides useful data that can be systematically analyzed. We use the results of these tests to verify that users will get the results we, as application managers and librarians, expect. It also will help us protect against regressions, or new, unexpected problems, when we make changes over time to Solr indexing schema or term weighting.

This work is heavily influenced by colleagues at Stanford who have both written about their (much more thorough at this point) relevancy tests and developed a Ruby Gem to assist others with doing similar work.

We are still working to identify common and troublesome searches but have already seen benefits of this approach and used it to identify (and resolve) deficiencies in title weighting and searching by common identifiers, among other issues. Our test code and test searches are available on Github for others to use as an example or to fork and apply to their own project.

Brown library staff who have examples of searches not producing expected results, please pass them on to Jeanette Norris or Ted Lawless.

— Jeanette Norris and Ted Lawless

Best bets for library search

The library has added “best bets” to the new easySearch tool. Best bets are commonly searched for library resources. Examples include JSTOR, Pubmed, and Web of Science. Searches for these phrases (as well as known alternate names and misspellings) will return a best bet highlighted at the top of the search results.

To get started, 64 resources have been selected as best bets and are available now via easySearch. As we would like to know how useful this feature is, please leave us feedback.

Thanks to colleagues at North Carolina State University for leading the way in adding best bets to library search and writing about their efforts.

Technical details

Library staff analyzed search logs to find commonly used search terms and matched those terms to appropriate resources. The name, url, and description for each resource is entered into a shared Google Spreadsheet. A script runs regularly to convert the spreadsheet data into Solr documents and posts the updates to a separate Solr core. The Blacklight application searches for best bet matches when users enter a search into the default search box.

Since the library maintains a database of e-resources, in many cases only the identifier for a resource is needed to populate the best bets index. The indexing script is able to retrieve the resource from the database and use that information to create the best bet. This eliminates maintaining data about the resources in multiple places.

Blacklight is the new discovery tool powering the search functionality on the library’s web site.