Brown University Homepage Brown University Library

Monitoring Passenger’s Requests in Queue over time

As I mentioned in a previous post we use Phusion Passenger as the application server to host our Ruby applications. A while ago upon the recommendation of my coworker Ben Cail I created a cron job that calls passenger-status every 5 minutes to log the status of Passenger in our servers.  Below is a sample of the passenger-status output:

Version : 5.1.12
Date : Mon Jul 30 10:42:54 -0400 2018
Instance: 8x6dq9uX (Apache/2.2.15 (Unix) DAV/2 Phusion_Passenger/5.1.12)

----------- General information -----------
Max pool size : 6
App groups : 1
Processes : 6
Requests in top-level queue : 0

----------- Application groups -----------
/path/to/our/app:
App root: /path/to/our/app
Requests in queue: 3
* PID: 43810 Sessions: 1 Processed: 20472 Uptime: 1d 7h 31m 25s
CPU: 0% Memory : 249M Last used: 1s ag
* PID: 2628 Sessions: 1 Processed: 1059 Uptime: 4h 34m 39s
CPU: 0% Memory : 138M Last used: 1s ago
* PID: 2838 Sessions: 1 Processed: 634 Uptime: 4h 30m 47s
CPU: 0% Memory : 134M Last used: 1s ago
* PID: 16836 Sessions: 1 Processed: 262 Uptime: 2h 14m 46s
CPU: 0% Memory : 160M Last used: 1s ago
* PID: 27431 Sessions: 1 Processed: 49 Uptime: 25m 27s
CPU: 0% Memory : 119M Last used: 0s ago
* PID: 27476 Sessions: 1 Processed: 37 Uptime: 25m 0s
CPU: 0% Memory : 117M Last used: 0s ago

Our cron job to log this information over time is something like this:

/path/to/.gem/gems/passenger-5.1.12/bin/passenger-status >> ./logs/passenger_status.log

Last week we had some issues in which our production server was experiencing short outages. Upon review we noticed that we were having a unusual amount of traffic coming to our server (most of it from crawlers submitting bad requests.) One of the tools that we used to validate the status of our server was the passenger_status.log file created via the aforementioned cron job.

The key piece of information that we use is the “Requests in queue” value highlighted above. We parsed this value of out the passenger_status.log file to see how it changed in the last 30 days. The result showed that although we have had a couple of outages recently the number of “requests in queue” dramatically increased about two weeks ago and it had stayed high ever since.

The graph below shows what we found. Notice how after August 19th the value of “requests in queue” has been constantly high, whereas before August 19th it was almost always zero or below 10.

Request in queue graph

We looked closely to our Apache and Rails logs and determined the traffic that was causing the problem. We took a few steps to handle it and now our servers are behaving as normal again. Notice how we are back to zero requests in queue on August 31st in the graph above.

The Ruby code that we use to parse our passenger_status.log file is pretty simple, it just grabs the line with the date and the line with the number of requests in queue, parses their values, and outputs the result to a tab delimited file that then we can use to create a graph in Excel or RAWGraphs. Below is the Ruby code:

require "date"

log_file = "passenger_status.log"
excel_date = true

def date_from_line(line, excel_date)
  index = line.index(":")
  return nil if index == nil
  date_as_text = line[index+2..-1].strip # Thu Aug 30 14:00:01 -0400 2018
  datetime = DateTime.parse(date_as_text).to_s # 2018-08-30T14:00:01-04:00
  if excel_date
    return datetime[0..9] + " " + datetime[11..15] # 2018-08-30 14:00
  end
  datetime
end

def count_from_line(line)
  return line.gsub("Requests in queue:", "").to_i
end

puts "timestamp\trequest_in_queue"
date = "N/A"
File.readlines(log_file).each do |line|
  if line.start_with?("Date ")
    date = date_from_line(line, excel_date)
  elsif line.include?("Requests in queue:")
    request_count = count_from_line(line)
    puts "\"#{date}\"\t#{request_count}"
  end
end

In this particular case the number of requests in queue was caused by bad/unwanted traffic. If the increase in traffic had been legitimate we would have taken a different route, like adding more processes to our Passenger instance to handle the traffic.

Looking at the Oxford Common Filesystem Layout (OCFL)

Currently, the BDR contains about 34TB of content. The storage layer is Fedora 3, and the data is stored internally by Fedora (instead of being stored externally). However, Fedora 3 is end-of-life. This means that we either maintain it ourselves, or migrate to something else. However, we don’t want to migrate 34TB, and then have to migrate it again if we change software again. We’d like to be able to change our software, without migrating all our data.

This is where the Oxford Common Filesystem Layout (OCFL) work is interesting. OCFL is an effort to define how repository objects should be laid out on the filesystem. OCFL is still very much a work-in-progress, but the “Need” section of the specification speaks directly to what I described above. If we set up our data using OCFL, hopefully we can upgrade and change our software as necessary without having to move all the data around.

Another benefit of the OCFL effort is that it’s work being done by people from multiple institutions, building on other work and experience in this area, to define a good, well-thought-out layout for repository objects.

Finally, using a common specification for the filesystem layout of our repository means that there’s a better chance that other software will understand how to interact with our files on disk. The more people using the same filesystem layout, the more potential collaborators and applications for implementing the OCFL specification – safely creating, updating, and serving out content for the repository.

Thumbnail cache

The BDR provides thumbnails for image and PDF objects. The thumbnail service is set up to check for a thumbnail in storage, then try to generate one from the IIIF image server, and fall back to an icon if needed. Thumbnails are cached by the thumbnail service for 30 days, up to a maximum of 5000 thumbnails. The maximum number of thumbnails was the limiting factor on the thumbnails in the cache – they weren’t being purged from the cache because they were older than 30 days, but because the cache filled up.

We recently had the need to purge some thumbnails, because we updated the images for some objects. We decided to update the thumbnail caching, so that the cache is timestamped. We already use our API to check for permissions on an object before displaying the thumbnail, and we added an API check for when the object in storage was last modified. If the object was modified more recently than the cache timestamp, the cache is stale and we grab an updated thumbnail. This should keep the thumbnail cache from serving stale images.

 

Automating Streaming Objects

In the BDR, we can provide streaming versions for audio or video content, in addition to (or instead of) file download. We used Wowza for streaming content in the past, but now we use Panopto.

The process for getting content to Panopto has been manual: download the file from the BDR, upload it to Panopto, set the correct session name in Panopto, and associate the Panopto ID with the BDR record. I’ve been working on automating this process, though.

Here are the steps that the automation code performs:

  • download audio or video file from BDR to a temporary file
  • hit the Panopto API to upload file
    • create the session and upload in Panopto
    • use the Amazon S3 protocol to upload the file
    • mark the upload as complete so Panopto starts processing the file
  • create a streaming object in the BDR with the correct Panopto session ID

We want to make sure that the process can handle large files without running out of memory or taxing the server too much. So, we stream the content to the temporary file in chunks. Then, when we upload the file to Panopto, we’d like to do that in chunks as well, so we’re never reading the whole file into memory – unfortunately, we’re currently running into an error with the multipart upload.

This automation will reduce the amount of manual work we do for streaming content, and could open the door to creating streaming objects automatically on request from non-BDR staff (or even users).

MySQL 5.7 migration

We recently migrated the BDR databases from MySQL version 5.5 to 5.7. Here are a couple benefits for us as application developers:

Stricter Data Handling

By default, MySQL 5.7 uses stricter data handling than 5.5, so we don’t have to manually put MySQL into strict mode.
MySQL 5.5’s loose data handling bit us last summer. We have an application where files can be uploaded, and the file names are stored in the database. A user started getting errors trying to upload new files, because the file names were duplicates (all the file names in the database are required to be unique). It turned out that the file names were too long for the field, so they were being truncated and put into the table anyway. Then, duplicate errors were thrown if a new file name truncated to the same name as another truncated file name. After that, we put MySQL into strict mode for some of our databases, but now it will be that way by default.

Support

The second benefit is that Django 2.1 won’t support 5.5 anymore, and MySQL 5.5 will be End-of-Life this year, so this migration gets us on a better-supported version of MySQL.

Now, if only ‘UTF-8’ in MySQL actually meant UTF-8… Actually, MySQL 8.0 was recently released, and it looks like it uses UTF8MB4 (ie. real UTF-8) by default, so that may be helpful in the future when we move to 8.0.

Configuring Ruby, Bundler, and Passenger

Recently we upgraded several of our applications to a newer version of Ruby which was relatively simple to do in our local development machines. However, we ran into complications once we started deploying the updated applications to our development and productions servers. The problems that we ran into highlighted issues in the way we had configured our applications and Passenger on the servers. This blog post elaborates on the final configurations that we arrived to (at least for now) and explains the rationale for the settings that worked for us.

Our setup

We deploy our applications using a “system account” (e.g. appuser) so that execution permissions and file ownership are not tied to the account of the developer doing the deployment.

We use Apache as our web server and Phusion Passenger as the application server to handle our Ruby applications.

And last but not least, we use Bundler to manage gems in our Ruby applications.

Application-level configuration

We perform all the steps to deploy a new version of our applications with the “system account” for the application (e.g. appuser.)

Since sometimes we have more than one version of Ruby in our servers we use chruby to switch between versions on the server when we are logged in as the appuser. However, we have learned that is better not to select a particular version of Ruby as part this user’s bash profile. Executing ruby -v as this user upon login will typical show the version that came with the operating system (e.g. “ruby 1.8.7”).

By leaving the system Ruby as the default we are forced to select the proper version of Ruby that we want on each application, this has the advantage that the configuration for each application is explicit on what version of Ruby it needs. This also makes applications less likely to break when we install a newer version of Ruby on the server. This is particularly useful in our development server where we have many Ruby applications running and each of them might be using a different version of Ruby.

If we want to do something for a particular application (say install gems or run a rake task) then we switch to the version of Ruby (via chruby) that we need for the application before executing the required commands.

We have also found useful to configure Bundler to install application gems inside the application folder rather than in a global folder. We do this via Bundler --path parameter. The only gem that we install globally (i.e. in GEM_HOME) is bundler.

A typical deployment script looks more or less like this.

Login to the remote server:

$ ssh our-production-machine

Switch to our system account on the remote server (notice that it references the Ruby that came with the operating system):

$ su - appuser

$ ruby -v
# => ruby 1.8.7 (2013-06-27 patchlevel 374) [x86_64-linux]
 
$ which ruby
# => /usr/bin/ruby

Activate the version of Ruby that we want for this app (notice that it references the Ruby that we installed):

$ source /opt/local/chruby/share/chruby/chruby.sh
$ chruby ruby-2.3.6

$ ruby -v 
# => ruby 2.3.6p384 (2017-12-14 revision 61254) [x86_64-linux] 
 
$ which ruby
# => ~/rubies/ruby-2.3.6/bin/ruby
 
$ env | grep GEM
# => GEM_HOME=/opt/local/.gem/ruby/2.3.6
# => GEM_ROOT=/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0
# => GEM_PATH=/opt/local/.gem/ruby/2.3.6:/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0

Install bundler (this is only needed the first time, notice how it is installed in GEM_HOME):

$ gem install bundler
$ gem list bundler -d
# => Installed at: /opt/local/.gem/ruby/2.3.6

Install the rest of the app, its gems, and execute some rake tasks (notice that Bundler will indicate that gems are being installed locally to ./vendor/bundle):

$ cd /path/to/appOne
$ git pull

$ RAILS_ENV=production bundle install --path vendor/bundle
# => Bundled gems are installed into `./vendor/bundle`

$ RAILS_ENV=production bundle exec rake assets:precompile

Passenger configuration

Our default passenger configuration is rather bare-bones and indicates only a few settings. For example our /etc/httpd/conf.d/passenger.conf looks more or less like this:

LoadModule passenger_module /opt/local/.gem/gems/passenger-5.1.12/buildout/apache2/mod_passenger.so

<IfModule mod_passenger.c>
  PassengerRoot /opt/local/.gem/gems/passenger-5.1.12
  PassengerUser appuser
  PassengerStartTimeout 300
</IfModule>

Include /path/to/appOne/http/project_passenger.conf
Include /path/to/appTwo/http/project_passenger.conf

Notice that there are no specific Ruby settings indicated above. The Ruby specific settings are indicated on the individual project_passenger.conf files for each application.

If we look at the passenger config for one of the apps (say /path/to/appOne/http/project_passenger.conf) it would look more or less like this:

<Location /appOne>
  PassengerBaseURI /appOne
  PassengerRuby /opt/local/rubies/ruby-2.3.6/bin/ruby
  PassengerAppRoot /path/to/appOne/
  SetEnv GEM_PATH /opt/local/.gem/ruby/2.3.6/
</Location>

Notice that this configuration indicates both the path to the Ruby version that we want for this application (PassengerRuby) and also where to find (global) gems for this application (GEM_PATH).

The value for PassengerRuby matches the path that which ruby returned above (/opt/local/rubies/ruby-2.3.6/bin/ruby) and clearly indicates that we are using version 2.3.6 for this application.

The GEM_PATH settings is very important since this is what allows Passenger to find bundler when loading our application. Not setting this value results in the application not loading and Apache logging the following error:

Could not spawn process for application /path/to/AppOne: An error occurred while starting up the preloader.
Error ID: dd0dcbd4
Error details saved to: /tmp/passenger-error-3OKItz.html
Message from application: cannot load such file -- bundler/setup (LoadError)
/opt/local/rubies/ruby-2.3.6/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
/opt/local/rubies/ruby-2.3.6/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'

Notice that we set the GEM_PATH value to the path returned by gem list bundler -d above. This is a bit tricky since if you are looking closely we are setting GEM_PATH to the value that GEM_HOME reported above (/opt/local/.gem/ruby/2.3.6/). I suspect we could have set GEM_PATH to /opt/local/.gem/ruby/2.3.6:/opt/local/rubies/ruby-2.3.6/lib/ruby/gems/2.3.0 to match the GEM_PATH above but we didn’t try that.

UPDATE: The folks at Phusion recommend setting GEM_HOME as well (even if Passenger does not need it) because some gems might need it.

 

— Hector Correa & Joe Mancino

Python/Django warnings

I recently updated a Django project from 1.8 to 1.11. In the process, I started turning warnings into errors. Django docs recommend resolving any deprecation warnings with current version, before upgrading to a new version of Django. In this case, I didn’t start my upgrade work by resolving warnings, but I did run the tests with warnings enabled for part of the process.

Here’s how to enable all warnings when you’re running your tests:

  1. From the CLI
    • use -Werror to raise Exceptions for all warnings
    • use -Wall to print all warnings
  2. In the code
    • import warnings; warnings.filterwarnings(‘error’) – raise Exceptions on all warnings
    • import warnings; warnings.filterwarnings(‘always’) – print all warnings

If a project runs with no warnings on a Django LTS release, it’ll (generally) run on the next LTS release as well. This is because Django intentionally tries to keep compatibility shims until after a LTS release, so that third-party applications can more easily support multiple LTS releases.

Enabling warnings is nice because you see warnings from python or other packages, so you can address whatever problems they’re warning about, or at least know that they will be an issue in the future.

Understanding scoring of documents in Solr

During the development of our new Researchers@Brown front-end I spent a fair amount of time looking at the results that Solr gives when users execute searches. Although I have always known that Solr uses a sophisticated algorithm to determine why a particular document matches a search and why a document ranks higher in the search result than others, I have never looked very close into the details on how this works.

This post is an introduction on how to interpret the ranking (scoring) that Solr reports for documents returned in a search.

Requesting Solr “explain” information

When submitting search request to Solr is possible to request debug information that clarifies how Solr interpreted the client request and information on how the score for each document was calculated for the given search terms.

To request this information you just need to pass debugQuery=true as a query string parameter to a normal Solr search request. For example:

$ curl "http://someserver/solr/collection1/select?q=alcohol&wt=json&debugQuery=true"

The response for this query will include debug information with a property called explain where the ranking of each of the documents is explained.

"debug": {
  ...
  "explain": {
    "id:1": "a lot of text here",
    "id:2": "a lot of text here",
    "id:3": "a lot of text here",
    "id:4": "a lot of text here",
    ...
  }
}

Raw “explain” output

Although Solr gives information about how the score for each document was calculated, the format that is uses to provide this information is horrendous. This is an example of how the explain information for a given document looks like:

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) [DefaultSimilarity], result of:\n
    4.409502E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n
), product of:\n
      2.2283042E-4 = queryWeight, product of:\n
        5.170344 = idf(docFreq=60, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      0.197886 = fieldWeight in 831, product of:\n
        2.4494898 = tf(freq=6.0), with freq of:\n
          6.0 = termFreq=6.0\n
        5.170344 = idf(docFreq=60, maxDocs=3949)\n
        0.015625 = fieldNorm(doc=831)\n
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) [DefaultSimilarity], result of:\n
    4.27615E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n), product of:\n
      2.1943514E-4 = queryWeight, product of:\n
        5.0915627 = idf(docFreq=65, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      0.1948708 = fieldWeight in 831, product of:\n
        2.4494898 = tf(freq=6.0), with freq of:\n
          6.0 = termFreq=6.0\n
        5.0915627 = idf(docFreq=65, maxDocs=3949)\n
        0.015625 = fieldNorm(doc=831)\n
  1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n
    1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n
), product of:\n
      0.1410609 = queryWeight, product of:\n
        400.0 = boost\n
        8.182606 = idf(docFreq=2, maxDocs=3949)\n
        4.3097796E-5 = queryNorm\n
      8.182606 = fieldWeight in 831, product of:\n
        1.0 = tf(freq=1.0), with freq of:\n
          1.0 = termFreq=1.0\n
        8.182606 = idf(docFreq=2, maxDocs=3949)\n
        1.0 = fieldNorm(doc=831)\n",

It’s unfortunately that the information comes in a format that is not easy to parse, but since it’s plain text, we can read it and analyze it.

Explaining “explain” information

If you look closely at the information in the previous text you’ll notice that Solr reports the score of a document as the maximum value (max of) from a set of other scores. For example, below is a simplified version of text (the ellipsis represent text that I suppressed):

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) ...
  ...
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ...
  ...
  1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) ...
  ..."

In this example, the score of 1.1542457 for the document with id http://vivo.brown.edu/individual/12345 was the maximum of three scores (4.409502E-5, 4.27615E-5, 1.1542457). Notice that the scores are in E-notation. If you look closely you’ll see that each of those scores is associated with a different field in Solr where the search term, alcohol, was found.

From the text above we can determine that the text alcohol was found on the ALLTEXTUNSTEMMED, ALLTEXT, and research_areas. Even more, we can also tell that for this particular search we are giving the research_areas field a boost of 400 which explains why that particular score was much higher than the rest.

The information that I omitted in the previous example provides a more granular explanation on how each of those individual field scores was calculated. For example, below is the detail for the research_areas field:

1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n
  1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n), product of:\n
    0.1410609 = queryWeight, product of:\n
      400.0 = boost\n
      8.182606 = idf(docFreq=2, maxDocs=3949)\n
      4.3097796E-5 = queryNorm\n
    8.182606 = fieldWeight in 831, product of:\n
      1.0 = tf(freq=1.0), with freq of:\n
        1.0 = termFreq=1.0\n
      8.182606 = idf(docFreq=2, maxDocs=3949)\n
      1.0 = fieldNorm(doc=831)\n",

Again, if we look closely at this text we see that the score of 1.1542457 for the research_areas field was the product of two other factors (0.1410609 x 8.182606). There is even information about how these individual factors were calculated. I will not go into details on them in this blog post but if you are interested this is a good place to start.

Another interesting clue that Solr provides in the explain information is what values were searched for in a given field. For example, if I search for the word alcoholism (instead of alcohol) the Solr explain result would show that in one of the fields it used the stemmed version of the search term and in other it used the original text. In our example, this would look more or less like this:

"id:http://vivo.brown.edu/individual/12345":"\n
1.1542457 = (MATCH) max of:\n
  4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcoholism in 831) ...
  ...
  4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ...
  ...

Notice how in the unstemmed field (ALLTEXTUNSTEMMED) Solr used the exact word searched (alcoholism) whereas in the stemmed field (ALLTEXT) it used the stemmed version (alcohol). This is very useful to know if you were wondering why a value was (or was not) found in a given field. Likewise, if you are using (query time) synonyms those will show in the Solr explain results.

Live examples

In our new Researchers@Brown site we have an option to show the explain information from Solr. This option is intended for developers to troubleshoot tricky queries, not for the average user.

For example, if you pass explain=text to a search URL in the site you’ll get the text of the Solr explain output formatted for each of the results (scroll to the very bottom of the page to see the explain results).

Likewise, if you pass explain=matches to a search URL the response will include only the values of the matches that Solr evaluated (along with the field and boost value) for each document.

Source code

If you are interested in the code that we use to parse the Solr explain results you can find it in our GitHub repo. The code for this lives in two classes Explainer and ExplainerEntry.

Explainer takes a Solr response and creates an array with the explain information for each result. This array is comprised of ExplainEntry objects that in turn parse each of the results to make the match information easily accessible. Keep in mind that this code does mostly string parsing and therefore is rather brittle. For example, the code to extract the matches for a given document is as follows:

class ExplainEntry
  ...
  def get_matches(text)
    lines = text.split("\n")
    lines.select {|l| l.include?("(MATCH)") || l.include?("coord(")}
  end
end

As you can imagine, if Solr changes the structure of the text that it returns in the explain results this code will break. I get the impression that this format (as ugly as it is) has been stable in many versions of Solr so hopefully we won’t have many issues with this implementation.

Fedora Functionality

We are currently using Fedora 3 for storing our repository object binaries and metadata. However, Fedora 3 is end of life and unsupported, so eventually we’ll have to decide what our plan for the future is. Here we inventory some of the functions that we use (or could use) from Fedora.  We’ll use this as a start for determining the features we’ll be looking for in a replacement.

  1. Binary & metadata storage
  2. Binary & metadata versioning
  3. Tracks object & file created/modified dates
  4. Checksum calculation/verification (after ingestion, during transmission to Fedora). Note: in Fedora 3.8.1, Fedora returns a 500 response with an empty body if the checksums don’t match – that makes Fedora’s checking less useful, since the API client can’t tell why the ingest caused an exception.
  5. SSL REST API for interacting with objects/content
  6. Messages generated whenever an object is added/updated/deleted
  7. Grouping of multiple binaries in one object
  8. Works with binaries stored outside of Fedora
  9. Files are human-readable
  10. Search (by state, date created, date modified – search provided by the database)
  11. Safety when updating the same object from multiple processes

Python 2 => 3

We’ve recently been migrating our code from Python 2 to Python 3. There is a lot of documentation about the changes, but these are changes we had to make in our code.

Print

First, the print statement had to be changed to the print function:

print 'message'

became

print('message')

Text and bytes

Python 3 change bytes and unicode text handling, so here some changes related to that:

json.dumps required a unicode string, instead of bytes, so

json.dumps(xml.serialize())

became

json.dumps(xml.serialize().decode('utf8'))

basestring was removed, so

isinstance("", basestring)

became

isinstance("", str)

This change to explicit unicode and bytes handling affected the way we opened files. In Python 2, we could open and use a binary file, without specifying that it was binary:

open('file.zip')

In Python 3, we have to specify that it’s a binary file:

open('file.zip', 'rb')

Some functions couldn’t handle unicode in Python 2, so in Python 3 we don’t have to encode the unicode as bytes:

urllib.quote(u'tëst'.encode('utf8'))

became

urllib.quote('tëst')

Of course, Python 3 reorganized parts of the standard library, so the last line would actually be:

urllib.parse.quote('tëst')

Dicts

There were also some changes to Python dicts. The keys() method now returns a view object, so

dict.keys()

became

list(dict.keys())
dict.iteritems()

also became

dict.items()

Virtual environments

Python 3 has virtual environments built in, which means we don’t need to install virtualenv anymore. There’s no activate_this.py in Python 3 environments, though, so we switched to using django-dotenv instead.

Miscellaneous

Some more changes we made include imports:

from base import * => from .base import *

function names:

func.func_name => func.__name__

and exceptions:

exception.message => str(exception)
except Exception, e => except Exception as e

Optional

Finally, there were optional changes we made. Python 3 uses UTF-8 encoding for source files by default, so we could remove the encoding line from the top of files. Also, the unicode u” prefix is allowed in Python 3, but not necessary.