Programming – Brown University Library Digital Technologies

Bundler 2.1.4 and homeless accounts

This week we upgraded a couple of our applications to Ruby 2.7 and Bundler 2.1.4 and one of the changes that we noticed was that Bundler was complaining about not being able to write to the /opt/local directory.

Turns out this problem shows up because the account that we use to run our application is a system account that does not have a home folder.

This is how the problems shows up:

$ su - system_account
$ pwd
/opt/local

$ mkdir test_app
$ cd test_app
$ pwd
/opt/local/test_app

$ gem install bundler -v 2.1.4
$ bundler --version
`/opt/local` is not writable.
Bundler will use `/tmp/bundler20200731-59360-174h3lz59360' as your home directory temporarily.
Bundler version 2.1.4

Notice that Bundler complains about the /opt/local directory not being writable, that’s because we don’t have home for this user, in fact env $HOME outputs /opt/local rather than the typical /home/username.

Although Bundler is smart enough to use a temporary folder instead and continue, the net result of this is that if we set a configuration value for Bundler in one execution and try to use that configuration value in the next execution Bundler won’t be able to find the value that we set in the first execution (my guess is because the value was saved in a temporary folder.)

Below is an example of this. Notice how we set the path value to vendor/bundle in the first command, but then when we inspect the configuration in the second command the configuration does not report the value that we just set:

# First - set the path value
$ bundle config set path 'vendor/bundle'
`/opt/local` is not writable.
Bundler will use `/tmp/bundler20200731-60203-16okmcg60203' as your home directory temporarily.

# Then - inspect the configuration
$ bundle config
`/opt/local` is not writable.
Bundler will use `/tmp/bundler20200731-60292-1r50oed60292' as your home directory temporarily.
Settings are listed in order of priority. The top value will be used.

Ideally the call to bundle config will report the vendor/bundle path that we set, but it does not in this case. In fact if we run bundle install next Bundler will install the gems in $GEM_PATH rather than using the custom vendor/bundle directory that we indicated.

Working around the issue

One way to work around this issue is to tell Bundler that the HOME directory is the one from where we are running bundler (i.e. /opt/local/test_app) in our case.

# First - set the path value 
# (no warning is reported)
$ HOME=/opt/local/test_app/ bundle config set path 'vendor/bundle'

# Then - inspect the configuration
$ bundle config
`/opt/local` is not writable.
Bundler will use `/tmp/bundler20200731-63230-11dmgcb63230' as your home directory temporarily.
Settings are listed in order of priority. The top value will be used.
path
Set for your local app (/opt/local/test_app/.bundle/config): "vendor/bundle"

Notice that we didn’t get a warning in the first command (since we indicated a HOME directory) and then, even though we didn’t pass a HOME directory to the second command, our value was picked up and shows the correct value for the path setting (vendor/bundle).

So it seems to me that when HOME is set to a non-writable directory (/opt/local in our case) Bundler picks up the values from ./bundle/config if it is available even as it complains about /opt/local not being writable.

If we were to run bundle install now it will install the gems in our local vendor/bundle directory. This is good for us, Bundler is using the value that we configured for the path setting (even though it still complains that it cannot write to /opt/local.)

We could avoid the warning in the second command if we pass the HOME value here too:

$ HOME=/opt/local/test-app/ bundle config
Settings are listed in order of priority. The top value will be used.
path
Set for your local app (/opt/local/test-app/.bundle/config): "vendor/bundle"

But the fact the Bundler picks up the correct values from ./bundle/config when HOME is set to a non-writable directory was important for us because it meant that when the app runs under Apache/Passenger it will also work. This is more or less how the configuration for our apps in http.conf looks like, notice that we are not setting the HOME value.

<Location />  
  PassengerBaseURI /test-app
  PassengerUser system_account
  PassengerRuby /opt/local/rubies/ruby-2.7.1/bin/ruby
  PassengerAppRoot /opt/local/test-app
  SetEnv GEM_PATH /opt/local/.gem/ruby/2.7.1/
</Location>

Some final thoughts

Perhaps a better solution would be to set a HOME directory for our system_account, but we have not tried that, we didn’t want to make such a wide reaching change to our environment just to please Bundler. Plus this might be problematic in our development servers where we share the same system_account for multiple applications (this is not a problem in our production servers)

We have no idea when this change took effect in Bundler. We went from Bundler 1.17.1 (released in October/2018) to Bundler 2.1.4 (released in January/2020) and there were many releases in between. Perhaps this was documented somewhere and we missed it.

In our particular situation we noticed this issue because one of our gems needed very specific parameters to be built during bundle install. We set those values via a call to bundle config build.mysql2 --with-mysql-dir=xxx mysql-lib=yyy and those values were lost by the time we ran bundle install and the installation kept failing. Luckily we found a work around and were able to install the gem with the specific parameters.

New RIAMCO website

A few days ago we released a new version of the Rhode Island Archival and Manuscript Collections Online (RIAMCO) website. The new version is a brand new codebase. This post describes a few of the new features that we implemented as part of the rewrite and how we designed the system to support them.

The RIAMCO website hosts information about archival and manuscript collections in Rhode Island. These collections (also known as finding aids) are stored as XML files using the Encoded Archival Description (EAD) standard and indexed into Solr to allow for full text searching and filtering.

Look and feel

The overall look and feel of the RIAMCO site is heavily influenced by the work that the folks at the NYU Libraries did on their site. Like NYU’s site and Brown’s Discovery tool the RIAMCO site uses the typical facets on the left, content on the right style that is common in many library and archive websites.

Below a screenshot on how the main search page looks like:

Architecture

Our previous site was put together over many years and it involved many separate applications written in different languages: the frontend was written in PHP, the indexer in Java, and the admin tool in (Python/Django). During this rewrite we bundled the code for the frontend and the indexer into a single application written in Ruby on Rails. [As of September 13th, 2019 the Rails application also provides the admin interface.]

You can view a diagram of this architecture and few more notes about it on this document.

Indexing

Like the previous version of the site, we are using Solr to power the search feature of the site. However, in the previous version each collection was indexed as a single Solr document whereas in the new version we are splitting each collection into many Solr documents: one document to store the main collection information (scope, biographical info, call number, et cetera), plus one document for each item in the inventory of the collection.

This new indexing strategy significantly increased the number of Solr documents that we store. We went from from 1100+ Solr documents (one for each collection) to 300,000+ Solr documents (one for each item in the inventory of those collections).

The advantage of this approach is that now we can search and find items at a much granular level than we did before. For example, we can tell a user that we found a match on “Box HE-4 Folder 354” of the Harris Ephemera collection for their search on blue moon rather than just telling them that there is a match somewhere in the 25 boxes (3,000 folders) in the “Harris Ephemera” collection.

In order to keep the relationship between all the Solr documents for a given collection we are using an extra ead_id_s field to store the id of the collection that each document belongs to. If we have a collection “A” with three items in the inventory they will have the following information in Solr:

{id: "A", ead_id_s: "A"} // the main collection record
{id: "A-1", ead_id_s: "A"} // item 1 in the inventory
{id: "A-2", ead_id_s: "A"} // item 2 in the inventory
{id: "A-3", ead_id_s: "A"} // item 3 in the inventory

This structure allows us to use the Result Grouping feature in Solr to group results from a search into the appropriate collection. With this structure in place we can then show the results grouped by collection as you can see in the previous screenshot.

The code to index our EAD files into Solr is on the Ead class.

We had do add some extra logic to handle cases when a match is found only on a Solr document for an inventory item (but not on the main collection) so that we can also display the main collection information along the inventory information in the search results. The code for this is on the search_grouped() function of the Search class.

Hit highlighting

Another feature that we implemented on the new site is hit highlighting. Although this is a feature that Solr supports out of the box there is some extra coding that we had to do to structure the information in a way that makes sense to our users. In particular things get tricky when the hit was found in a multi value field or when Solr only returns a snippet of the original value in the highlights results. The logic that we wrote to handle this is on the SearchItem class.

Advanced Search

We also did an overhaul to the Advanced Search feature. The layout of the page is very typical (it follows the style used in most Blacklight applications) but the code behind it allows us to implement several new features. For example, we allow the user to select any value from the facets (not only one of the first 10 values for that facet) and to select more than one value from those facets.

We also added a “Check” button to show the user what kind of Boolean expression would be generated for the query that they have entered. Below is a screenshot of the results of the check syntax for a sample query.

There are several tweaks and optimizations that we would like to do on this page, for example, opening the facet by Format is quite slow and it could be optimized. Also, the code to parse the expression could be written to use a more standard Tokenizer/Parser structure. We’ll get to that later on… hopefully : )

Individual finding aids

Like on the previous version of the site, the rendering of individual finding aids is done by applying XSLT transformations to the XML with the finding aid data. We made a few tweaks to the XSLT to integrate them on the new site but the vast majority of the transformations came as-is from the previous site. You can see the XSLT files in our GitHub repo.

It’s interesting that GitHub reports that half of the code for the new site is XSLT: 49% XSLT, 24% HTML, and 24% Ruby. Keep in mind that these numbers do not take into account the Ruby on Rails code (which is massive.)

Source code

The source code for the new application is available in GitHub.

Acknowledgements

Although I wrote the code for the new site, there were plenty of people that helped me along the way in this implementation, in particular Karen Eberhart and Joe Mancino. Karen provided the specs for the new site, answered my many questions about the structure of EAD files, and suggested many improvements and tweaks to make the site better. Joe helped me find the code for the original site and indexer, and setup the environment for the new one.

Deploying with shiv

I recently watched a talk called “Containerless Django – Deploying without Docker”, by Peter Baumgartner. Peter lists some benefits of Docker: that it gives you a pipeline for getting code tested and deployed, the container adds some security to the app, state can be isolated in the container, and it lets you run the exact same code in development and production.

Peter also lists some drawbacks to Docker: it’s a lot of code that could slow things down or have bugs, docker artifacts can be relatively large, and it adds extra abstractions to the system (eg. filesystem, network). He argues that an ideal deployment would include downloading a binary, creating a configuration file, and running it (like one can do with compiled C or Go programs).

Peter describes a process of deploying Django apps by creating a zipapp using shiv and goodconf, and deploying it with systemd constraints that add to the security. He argues that this process achieves most of the benefits of Docker, but more simply, and that there’s a sweet spot for application size where this type of deploy is a good solution.

I decided to try using shiv with our image server Loris. I ran the shiv command “shiv -o loris.pyz .”, and I got the following error:

User “loris” and or group “loris” do(es) not exist.
Please create this user, e.g.:
`useradd -d /var/www/loris -s /sbin/false loris`

The issue is that in the Loris setup.py file, the install process not only checks for the loris user as shown in the error, but it also sets up directories on the filesystem (including setting the owner and permission, which requires root permissions). I submitted a PR to remove the filesystem setup from the Python package installation (and put it in a script the user can run), and hopefully in the future it will be easier to package up Loris and deploy it different ways.

Searching for hierarchical data in Solr

Recently I had to index a dataset into Solr in which the original items had a hierarchical relationship among them. In processing this data I took some time to look into the ancestor_path and descendent_path features that Solr provides out of the box and see if and how they could help to issue searches based on the hierarchy of the data. This post elaborates on what I learned in the process.

Let’s start with some sample hierarchical data to illustrate the kind of relationship that I am describing in this post. Below is a short list of databases and programming languages organized by type.

Databases
  ├─ Relational
  │   ├─ MySQL
  │   └─ PostgreSQL
  └─ Document
      ├─ Solr
      └─ MongoDB
Programming Languages
  └─ Object Oriented
      ├─ Ruby
      └─ Python

For the purposes of this post I am going to index each individual item shown in the hierarchy, not just the children items. In other words I am going to create 11 Solr documents: one for “Databases”, another for “Relational”, another for “MySQL”, and so on.

Each document is saved with an id, a title, and a path. For example, the document for “Databases” is saved as:

{ 
  "id": "001", 
  "title_s": "Databases",
  "x_ancestor_path": "db",
  "x_descendent_path": "db" }

and the one for “MySQL” is saved as:

{ 
  "id": "003", 
  "title_s": "MySQL",
  "x_ancestor_path": "db/rel/mysql",
  "x_descendent_path": "db/rel/mysql" }

The x_ancestor_path and x_descendent_path fields in the JSON data represent the path for each of these documents in the hierarcy. For example, the top level “Databases” document uses the path “db” where the lowest level document “MySQL” uses “db/rel/mysql”. I am storing the exact same value on both fields so that later on we can see how each of them provides different features and addresses different use cases.

ancestor_path and descendent_path

The ancestor_path and descendent_path field types come predefined in Solr. Below is the definition of the descendent_path in a standard Solr 7 core:

$ curl http://localhost:8983/solr/your-core/schema/fieldtypes/descendent_path
{
  ...
  "indexAnalyzer":{
    "tokenizer":{ 
      "class":"solr.PathHierarchyTokenizerFactory", "delimiter":"/"}},
  "queryAnalyzer":{
    "tokenizer":{ 
      "class":"solr.KeywordTokenizerFactory"}}}}

Notice how it uses the PathHierarchyTokenizerFactory tokenizer when indexing values of this type and that it sets the delimiter property to /. This means that when values are indexed they will be split into individual tokens by this delimiter. For example the value “db/rel/mysql” will be split into “db”, “db/rel”, and “db/rel/mysql”. You can validate this in the Analysis Screen in the Solr Admin tool.

The ancestor_path field is the exact opposite, it uses the PathHierarchyTokenizerFactory at query time and the KeywordTokenizerFactory at index time.

There are also two dynamic field definitions *_descendent_path and *_ancestor_path that automatically create fields with these types. Hence the wonky x_descendent_path and x_ancestor_path field names that I am using in this demo.

Finding descendants

The descendent_path field definition in Solr can be used to find all the descendant documents in the hierarchy for a given path. For example, if I query for all documents where the descendant path is “db” (q=x_descendent_path:db) I should get all document in the “Databases” hierarchy, but not the ones under “Programming Languages”. For example:

$ curl "http://localhost:8983/solr/your-core/select?q=x_descendent_path:db&fl=id,title_s,x_descendent_path"
{
  ...
  "response":{"numFound":7,"start":0,"docs":[
  {
    "id":"001",
    "title_s":"Databases",
    "x_descendent_path":"db"},
  {
    "id":"002",
    "title_s":"Relational",
    "x_descendent_path":"db/rel"},
  {
    "id":"003",
    "title_s":"MySQL",
    "x_descendent_path":"db/rel/mysql"},
  {
    "id":"004",
    "title_s":"PostgreSQL",
    "x_descendent_path":"db/rel/pg"},
  {
    "id":"005",
    "title_s":"Document",
    "x_descendent_path":"db/doc"},
  {
    "id":"006",
    "title_s":"MongoDB",
    "x_descendent_path":"db/doc/mongo"},
  {
    "id":"007",
    "title_s":"Solr",
    "x_descendent_path":"db/doc/solr"}]
}}

Finding ancestors

The ancestor_path not surprisingly can be used to achieve the reverse. Given the path of a given document we can query Solr to find all its ancestors in the hierarchy. For example if I query Solr for the documents where x_ancestor_path is “db/doc/solr” (q=x_ancestor_path:db/doc/solr) I should get “Databases”, “Document”, and “Solr” as shown below:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&fl=id,title_s,x_ancestor_path"
{
  ...
  "response":{"numFound":3,"start":0,"docs":[
  {
    "id":"001",
    "title_s":"Databases",
    "x_ancestor_path":"db"},
  {
    "id":"005",
    "title_s":"Document",
    "x_ancestor_path":"db/doc"},
  {
    "id":"007",
    "title_s":"Solr",
    "x_ancestor_path":"db/doc/solr"}]
}}

If you are curious how this works internally, you could issue a query with debugQuery=true and look at how the query value “db/doc/solr” was parsed. Notice how Solr splits the query value by the / delimiter and uses something called SynonymQuery() to handle the individual values as synonyms:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&debugQuery=true"
{
  ...
  "debug":{
    "rawquerystring":"x_ancestor_path:db/doc/solr",
    "parsedquery":"SynonymQuery(Synonym(x_ancestor_path:db x_ancestor_path:db/doc x_ancestor_path:db/doc/solr))",
...
}

One little gotcha

Given that Solr is splitting the path values by the / delimiter and that we can see those values in the Analysis Screen (or when passing debugQuery=true) we might expect to be able to fetch those values from the document somehow. But that is not the case. The individual tokens are not stored in a way that you can fetch them, i.e. there is no way for us to fetch the individual “db”, “db/doc”, and “db/doc/solr” values when fetching document id “007”. In hindsight this is standard Solr behavior but something that threw me off initially.

Monitoring Passenger’s Requests in Queue over time

As I mentioned in a previous post we use Phusion Passenger as the application server to host our Ruby applications. A while ago upon the recommendation of my coworker Ben Cail I created a cron job that calls passenger-status every 5 minutes to log the status of Passenger in our servers. Below is a sample of the passenger-status output:

Version : 5.1.12
Date : Mon Jul 30 10:42:54 -0400 2018
Instance: 8x6dq9uX (Apache/2.2.15 (Unix) DAV/2 Phusion_Passenger/5.1.12)

----------- General information -----------
Max pool size : 6
App groups : 1
Processes : 6
Requests in top-level queue : 0

----------- Application groups -----------
/path/to/our/app:
App root: /path/to/our/app
Requests in queue: 3
* PID: 43810 Sessions: 1 Processed: 20472 Uptime: 1d 7h 31m 25s
CPU: 0% Memory : 249M Last used: 1s ag
* PID: 2628 Sessions: 1 Processed: 1059 Uptime: 4h 34m 39s
CPU: 0% Memory : 138M Last used: 1s ago
* PID: 2838 Sessions: 1 Processed: 634 Uptime: 4h 30m 47s
CPU: 0% Memory : 134M Last used: 1s ago
* PID: 16836 Sessions: 1 Processed: 262 Uptime: 2h 14m 46s
CPU: 0% Memory : 160M Last used: 1s ago
* PID: 27431 Sessions: 1 Processed: 49 Uptime: 25m 27s
CPU: 0% Memory : 119M Last used: 0s ago
* PID: 27476 Sessions: 1 Processed: 37 Uptime: 25m 0s
CPU: 0% Memory : 117M Last used: 0s ago

Our cron job to log this information over time is something like this:

/path/to/.gem/gems/passenger-5.1.12/bin/passenger-status >> ./logs/passenger_status.log

Last week we had some issues in which our production server was experiencing short outages. Upon review we noticed that we were having a unusual amount of traffic coming to our server (most of it from crawlers submitting bad requests.) One of the tools that we used to validate the status of our server was the passenger_status.log file created via the aforementioned cron job.

The key piece of information that we use is the “Requests in queue” value highlighted above. We parsed this value of out the passenger_status.log file to see how it changed in the last 30 days. The result showed that although we have had a couple of outages recently the number of “requests in queue” dramatically increased about two weeks ago and it had stayed high ever since.

The graph below shows what we found. Notice how after August 19th the value of “requests in queue” has been constantly high, whereas before August 19th it was almost always zero or below 10.

We looked closely to our Apache and Rails logs and determined the traffic that was causing the problem. We took a few steps to handle it and now our servers are behaving as normal again. Notice how we are back to zero requests in queue on August 31st in the graph above.

The Ruby code that we use to parse our passenger_status.log file is pretty simple, it just grabs the line with the date and the line with the number of requests in queue, parses their values, and outputs the result to a tab delimited file that then we can use to create a graph in Excel or RAWGraphs. Below is the Ruby code:

require "date"

log_file = "passenger_status.log"
excel_date = true

def date_from_line(line, excel_date)
  index = line.index(":")
  return nil if index == nil
  date_as_text = line[index+2..-1].strip # Thu Aug 30 14:00:01 -0400 2018
  datetime = DateTime.parse(date_as_text).to_s # 2018-08-30T14:00:01-04:00
  if excel_date
    return datetime[0..9] + " " + datetime[11..15] # 2018-08-30 14:00
  end
  datetime
end

def count_from_line(line)
  return line.gsub("Requests in queue:", "").to_i
end

puts "timestamp\trequest_in_queue"
date = "N/A"
File.readlines(log_file).each do |line|
  if line.start_with?("Date ")
    date = date_from_line(line, excel_date)
  elsif line.include?("Requests in queue:")
    request_count = count_from_line(line)
    puts "\"#{date}\"\t#{request_count}"
  end
end

In this particular case the number of requests in queue was caused by bad/unwanted traffic. If the increase in traffic had been legitimate we would have taken a different route, like adding more processes to our Passenger instance to handle the traffic.

Configuring Ruby, Bundler, and Passenger

Recently we upgraded several of our applications to a newer version of Ruby which was relatively simple to do in our local development machines. However, we ran into complications once we started deploying the updated applications to our development and productions servers. The problems that we ran into highlighted issues in the way we had configured our applications and Passenger on the servers. This blog post elaborates on the final configurations that we arrived to (at least for now) and explains the rationale for the settings that worked for us.

Our setup

We deploy our applications using a “system account” (e.g. appuser) so that execution permissions and file ownership are not tied to the account of the developer doing the deployment.

We use Apache as our web server and Phusion Passenger as the application server to handle our Ruby applications.

And last but not least, we use Bundler to manage gems in our Ruby applications.

Application-level configuration

We perform all the steps to deploy a new version of our applications with the “system account” for the application (e.g. appuser.)

Since sometimes we have more than one version of Ruby in our servers we use chruby to switch between versions on the server when we are logged in as the appuser. However, we have learned that is better not to select a particular version of Ruby as part this user’s bash profile. Executing ruby -v as this user upon login will typical show the version that came with the operating system (e.g. “ruby 1.8.7”).

By leaving the system Ruby as the default we are forced to select the proper version of Ruby that we want on each application, this has the advantage that the configuration for each application is explicit on what version of Ruby it needs. This also makes applications less likely to break when we install a newer version of Ruby on the server. This is particularly useful in our development server where we have many Ruby applications running and each of them might be using a different version of Ruby.

If we want to do something for a particular application (say install gems or run a rake task) then we switch to the version of Ruby (via chruby) that we need for the application before executing the required commands.

We have also found useful to configure Bundler to install application gems inside the application folder rather than in a global folder. We do this via Bundler --path parameter. The only gem that we install globally (i.e. in GEM_HOME) is bundler.

A typical deployment script looks more or less like this.

$ ssh our-production-machine

Switch to our system account on the remote server (notice that it references the Ruby that came with the operating system):

$ su - appuser

$ ruby -v
# => ruby 1.8.7 (2013-06-27 patchlevel 374) [x86_64-linux]
 
$ which ruby
# => /usr/bin/ruby

Activate the version of Ruby that we want for this app (notice that it references the Ruby that we installed):

$ source /opt/local/chruby/share/chruby/chruby.sh
$ chruby ruby-2.3.6

$ ruby -v 
# => ruby 2.3.6p384 (2017-12-14 revision 61254) [x86_64-linux] 
 
$ which ruby
# => ~/rubies/ruby-2.3.6/bin/ruby
 
$ env | grep GEM
# => GEM_HOME=/opt/local/.gem/ruby/2.3.6
# => GEM_ROOT=/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0
# => GEM_PATH=/opt/local/.gem/ruby/2.3.6:/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0

Install bundler (this is only needed the first time, notice how it is installed in GEM_HOME):

$ gem install bundler
$ gem list bundler -d
# => Installed at: /opt/local/.gem/ruby/2.3.6

Install the rest of the app, its gems, and execute some rake tasks (notice that Bundler will indicate that gems are being installed locally to ./vendor/bundle):

$ cd /path/to/appOne
$ git pull

$ RAILS_ENV=production bundle install --path vendor/bundle
# => Bundled gems are installed into `./vendor/bundle`

$ RAILS_ENV=production bundle exec rake assets:precompile

Passenger configuration

Our default passenger configuration is rather bare-bones and indicates only a few settings. For example our /etc/httpd/conf.d/passenger.conf looks more or less like this:

LoadModule passenger_module /opt/local/.gem/gems/passenger-5.1.12/buildout/apache2/mod_passenger.so

<IfModule mod_passenger.c>
  PassengerRoot /opt/local/.gem/gems/passenger-5.1.12
  PassengerUser appuser
  PassengerStartTimeout 300
</IfModule>

Include /path/to/appOne/http/project_passenger.conf
Include /path/to/appTwo/http/project_passenger.conf

Notice that there are no specific Ruby settings indicated above. The Ruby specific settings are indicated on the individual project_passenger.conf files for each application.

If we look at the passenger config for one of the apps (say /path/to/appOne/http/project_passenger.conf) it would look more or less like this:

<Location /appOne>
  PassengerBaseURI /appOne
  PassengerRuby /opt/local/rubies/ruby-2.3.6/bin/ruby
  PassengerAppRoot /path/to/appOne/
  SetEnv GEM_PATH /opt/local/.gem/ruby/2.3.6/
</Location>

Notice that this configuration indicates both the path to the Ruby version that we want for this application (PassengerRuby) and also where to find (global) gems for this application (GEM_PATH).

The value for PassengerRuby matches the path that which ruby returned above (/opt/local/rubies/ruby-2.3.6/bin/ruby) and clearly indicates that we are using version 2.3.6 for this application.

The GEM_PATH settings is very important since this is what allows Passenger to find bundler when loading our application. Not setting this value results in the application not loading and Apache logging the following error:

Could not spawn process for application /path/to/AppOne: An error occurred while starting up the preloader.
Error ID: dd0dcbd4
Error details saved to: /tmp/passenger-error-3OKItz.html
Message from application: cannot load such file -- bundler/setup (LoadError)
/opt/local/rubies/ruby-2.3.6/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
/opt/local/rubies/ruby-2.3.6/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'

Notice that we set the GEM_PATH value to the path returned by gem list bundler -d above. This is a bit tricky since if you are looking closely we are setting GEM_PATH to the value that GEM_HOME reported above (/opt/local/.gem/ruby/2.3.6/). I suspect we could have set GEM_PATH to /opt/local/.gem/ruby/2.3.6:/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0 to match the GEM_PATH above but we didn’t try that.

UPDATE: The folks at Phusion recommend setting GEM_HOME as well (even if Passenger does not need it) because some gems might need it.

— Hector Correa & Joe Mancino

Python/Django warnings

I recently updated a Django project from 1.8 to 1.11. In the process, I started turning warnings into errors. Django docs recommend resolving any deprecation warnings with current version, before upgrading to a new version of Django. In this case, I didn’t start my upgrade work by resolving warnings, but I did run the tests with warnings enabled for part of the process.

Here’s how to enable all warnings when you’re running your tests:

From the CLI
- use -Werror to raise Exceptions for all warnings
- use -Wall to print all warnings
In the code
- import warnings; warnings.filterwarnings(‘error’) – raise Exceptions on all warnings
- import warnings; warnings.filterwarnings(‘always’) – print all warnings

If a project runs with no warnings on a Django LTS release, it’ll (generally) run on the next LTS release as well. This is because Django intentionally tries to keep compatibility shims until after a LTS release, so that third-party applications can more easily support multiple LTS releases.

Enabling warnings is nice because you see warnings from python or other packages, so you can address whatever problems they’re warning about, or at least know that they will be an issue in the future.

Python 2 => 3

We’ve recently been migrating our code from Python 2 to Python 3. There is a lot of documentation about the changes, but these are changes we had to make in our code.

Print

First, the print statement had to be changed to the print function:

print 'message'

became

print('message')

Text and bytes

Python 3 change bytes and unicode text handling, so here some changes related to that:

json.dumps required a unicode string, instead of bytes, so

json.dumps(xml.serialize())

became

json.dumps(xml.serialize().decode('utf8'))

basestring was removed, so

isinstance("", basestring)

became

isinstance("", str)

This change to explicit unicode and bytes handling affected the way we opened files. In Python 2, we could open and use a binary file, without specifying that it was binary:

open('file.zip')

In Python 3, we have to specify that it’s a binary file:

open('file.zip', 'rb')

Some functions couldn’t handle unicode in Python 2, so in Python 3 we don’t have to encode the unicode as bytes:

urllib.quote(u'tëst'.encode('utf8'))

became

urllib.quote('tëst')

Of course, Python 3 reorganized parts of the standard library, so the last line would actually be:

urllib.parse.quote('tëst')

Dicts

There were also some changes to Python dicts. The keys() method now returns a view object, so

dict.keys()

became

list(dict.keys())

dict.iteritems()

also became

dict.items()

Virtual environments

Python 3 has virtual environments built in, which means we don’t need to install virtualenv anymore. There’s no activate_this.py in Python 3 environments, though, so we switched to using django-dotenv instead.

Miscellaneous

Some more changes we made include imports:

from base import * => from .base import *

function names:

func.func_name => func.__name__

and exceptions:

exception.message => str(exception)
except Exception, e => except Exception as e

Optional

Finally, there were optional changes we made. Python 3 uses UTF-8 encoding for source files by default, so we could remove the encoding line from the top of files. Also, the unicode u” prefix is allowed in Python 3, but not necessary.

Using synonyms in Solr

A few days ago somebody reported that our catalog returns different results if a user searches for “music for the hundred years war” than if the user searches for “music for the 100 years war”.

To handle this issue I decided to use the synonyms feature in Solr. My thought was to tell Solr that “100” and “hundred” are synonyms and they should be treated as such. I had seen a synonyms.txt file in the Solr configuration folder and I thought it was just a matter of adding a few lines to this file and voilà synonyms will kick-in. It turns out using synonyms in Solr is a
bit more complicated than that, not too complicated, but not as straightforward as I had thought.

Configuring synonyms in Solr

To configure Solr to use synonyms you need to add a filter to the field type where you want synonyms to be used. For example, to enable synonyms for the text field in Solr I added a filter using the SynonymFilterFactory in our schema.xml

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.ICUFoldingFilterFactory" />
   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
   <filter class="solr.SnowballPorterFilterFactory" language="English" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.ICUFoldingFilterFactory" />
   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
   <filter class="solr.SnowballPorterFilterFactory" language="English" />
 </analyzer>
</fieldType>

You can add this filter for indexing, for querying, or both. In the example above I am only configuring the use of synonyms at query time.

Notice how the SynonymFilterFactory references a synonyms.txt file. This text file is where synonyms are defined. Notice also the expanded=true setting.

File synonyms.txt accepts the list of synonyms in two formats. The first format is just a list of words that are considered synonyms, for example:

 100,hundred

With this format every time Solr see “100” or “hundred” in a value it will automatically expand the value to include “100” and “hundred”. For example, if we were to search for “music for the hundred years war” it will actually search for “music for the 100 hundred years war”, notice how it now includes both variations (100 and hundred) in the text to search. The same will be true if we were to search for “music for the 100 years war”, Solr will search for both variations.

A second format we can use to configure synonyms is by using the => operator to consolidate various terms into a different term, for example:

 100 => hundred

With this format every time Solr sees “100” it will replace it with “hundred”. For example if we search for “music for the 100 years war” it will search for “music for the hundred years war”. Notice that in this case Solr will include “hundred” but drop “100”. The => in synonyms.txt is a shortcut to override the expand=true setting to replace the values on the left with the values on the right side.

Testing synonym matching in Solr

To see how synonyms are applied you can use the “Analysis” option available on the Solr dashboard page.

The following picture shows how this tool can be used to verify how Solr is handling synonyms at index time. Notice, in the highlighted rectangle, how “hundred” was indexed as both “hundred” and “100”.

We can also use this tool to see how values are handled at query time. The following picture shows how a query for “music for the 100 years war” is handled and matched to an original text “music for the hundred years war”. In this particular case synonyms are enabled in the Solr configuration only at query time which explains why the indexed value (on the left side) only has “hundred” but the value used at query time has been expanded to included both “100” and “hundred” which results in a match.

Index vs Query time

When configuring synonyms in Solr is important to consider the advantages and disadvantages of using them at index time, query time, or both.

Using synonyms at query time is easy because you don’t have to change your index to add or remove synonyms. You just add/remove lines from the synonyms.txt file, restart your Solr core, and the synonyms are applied in subsequent searches.

However, there are some benefits of using synonyms at index time particularly when you want to handle multi-term synonyms. This blog post by John Berryman and this page on the Apache documentation for Solr give a good explanation on why multi-term synonyms are tricky and why applying synonyms at index time might be a good idea. An obvious disadvantage of applying synonyms at index time is that you need to reindex your data for changes to the synonyms.txt to take effect.

Django vs. Flask Hello-World performance

Flask and Django are two popular Python web frameworks. Recently, I did some basic comparisons of a “Hello-World” minimal application in each framework. I compared the source lines of code, disk usage, RAM usage in a running process, and response times and throughput.

Lines of Code

Both Django and Flask applications can be written in one file. The Flask homepage has an example Hello-World application, and it’s seven lines of code. The Lightweight Django authors have an example one-page application that’s 29 source lines of code. As I played with that example, I trimmed it down to 17 source lines of code, and it still worked.

Disk Usage

I measured disk usage of the two frameworks by setting up two different Python 3.6 virtual environments. In one, I ran “pip install flask”, and in the other I ran “pip install django.” Then I ran “du -sh” on the whole env/ directory. The size of the Django virtual environment was 54M, and the Flask virtual environment was 15M.

Here are the packages in the Django environment:

Django (1.11.1)
pip (9.0.1)
pytz (2017.2)
setuptools (28.8.0)

Here are the packages in the Flask environment:

click (6.7)
Flask (0.12.1)
itsdangerous (0.24)
Jinja2 (2.9.6)
MarkupSafe (1.0)
pip (9.0.1)
setuptools (28.8.0)
Werkzeug (0.12.1)

Memory Usage

I also measured the RAM usage of both applications. I deployed them with Phusion Passenger, and then the passenger-status command told me how much memory the application process was using. According to Passenger, the Django process was using 18-19M, and the Flask process was using 16M.

Loading-testing with JMeter

Finally, I did some JMeter load-testing for both applications. I hit both applications with about 1000 requests, and looked at the JMeter results. The response time average was identical: 5.76ms. The Django throughput was 648.54 responses/second, while the Flask throughput was 656.62.

Final remarks

This was basic testing, and I’m not an expert in this area. Here are some links related to performance:

Slides from a conference talk
Blog post comparing performance of Django on different application servers, on different versions of Python