Ben Cail – Page 2 – Brown University Library Digital Technologies

Python/Django warnings

I recently updated a Django project from 1.8 to 1.11. In the process, I started turning warnings into errors. Django docs recommend resolving any deprecation warnings with current version, before upgrading to a new version of Django. In this case, I didn’t start my upgrade work by resolving warnings, but I did run the tests with warnings enabled for part of the process.

Here’s how to enable all warnings when you’re running your tests:

From the CLI
- use -Werror to raise Exceptions for all warnings
- use -Wall to print all warnings
In the code
- import warnings; warnings.filterwarnings(‘error’) – raise Exceptions on all warnings
- import warnings; warnings.filterwarnings(‘always’) – print all warnings

If a project runs with no warnings on a Django LTS release, it’ll (generally) run on the next LTS release as well. This is because Django intentionally tries to keep compatibility shims until after a LTS release, so that third-party applications can more easily support multiple LTS releases.

Enabling warnings is nice because you see warnings from python or other packages, so you can address whatever problems they’re warning about, or at least know that they will be an issue in the future.

Fedora Functionality

We are currently using Fedora 3 for storing our repository object binaries and metadata. However, Fedora 3 is end of life and unsupported, so eventually we’ll have to decide what our plan for the future is. Here we inventory some of the functions that we use (or could use) from Fedora. We’ll use this as a start for determining the features we’ll be looking for in a replacement.

Binary & metadata storage
Binary & metadata versioning
Tracks object & file created/modified dates
Checksum calculation/verification (after ingestion, during transmission to Fedora). Note: in Fedora 3.8.1, Fedora returns a 500 response with an empty body if the checksums don’t match – that makes Fedora’s checking less useful, since the API client can’t tell why the ingest caused an exception.
SSL REST API for interacting with objects/content
Messages generated whenever an object is added/updated/deleted
Grouping of multiple binaries in one object
Works with binaries stored outside of Fedora
Files are human-readable
Search (by state, date created, date modified – search provided by the database)
Safety when updating the same object from multiple processes

Python 2 => 3

We’ve recently been migrating our code from Python 2 to Python 3. There is a lot of documentation about the changes, but these are changes we had to make in our code.

Print

First, the print statement had to be changed to the print function:

print 'message'

became

print('message')

Text and bytes

Python 3 change bytes and unicode text handling, so here some changes related to that:

json.dumps required a unicode string, instead of bytes, so

json.dumps(xml.serialize())

became

json.dumps(xml.serialize().decode('utf8'))

basestring was removed, so

isinstance("", basestring)

became

isinstance("", str)

This change to explicit unicode and bytes handling affected the way we opened files. In Python 2, we could open and use a binary file, without specifying that it was binary:

open('file.zip')

In Python 3, we have to specify that it’s a binary file:

open('file.zip', 'rb')

Some functions couldn’t handle unicode in Python 2, so in Python 3 we don’t have to encode the unicode as bytes:

urllib.quote(u'tëst'.encode('utf8'))

became

urllib.quote('tëst')

Of course, Python 3 reorganized parts of the standard library, so the last line would actually be:

urllib.parse.quote('tëst')

Dicts

There were also some changes to Python dicts. The keys() method now returns a view object, so

dict.keys()

became

list(dict.keys())

dict.iteritems()

also became

dict.items()

Virtual environments

Python 3 has virtual environments built in, which means we don’t need to install virtualenv anymore. There’s no activate_this.py in Python 3 environments, though, so we switched to using django-dotenv instead.

Miscellaneous

Some more changes we made include imports:

from base import * => from .base import *

function names:

func.func_name => func.__name__

and exceptions:

exception.message => str(exception)
except Exception, e => except Exception as e

Optional

Finally, there were optional changes we made. Python 3 uses UTF-8 encoding for source files by default, so we could remove the encoding line from the top of files. Also, the unicode u” prefix is allowed in Python 3, but not necessary.

Testing HTTP calls in Python

Many applications make calls to external services, or other services that are part of the application. Testing those HTTP calls can be challenging, but there are some different options available in Python.

Mocking

One option for testing your HTTP calls is to mock out your function that makes the HTTP call. This way, your function doesn’t make the HTTP call, since it’s replaced by a mock function that just returns whatever you want it to.

Here’s an example of mocking out your HTTP call:

import requests

class SomeClass:

  def __init__(self):
    self.data = self._fetch_data()

  def _fetch_data(self):
    r = requests.get('https://repository.library.brown.edu/api/collections/')
    return r.json()

  def get_collection_ids(self):
    return [c['id'] for c in self.data['collections']]

from unittest.mock import patch
MOCK_DATA = {'collections': [{'id': 1}, {'id': 2}]}

with patch.object(SomeClass, '_fetch_data', return_value=MOCK_DATA) as mock_method:
  thing = SomeClass()
  assert thing.get_collection_ids() == [1, 2]

Another mocking option is the responses package. Responses mocks out the requests library specifically, so if you’re using requests, you can tell the responses package what you want each requests call to return.

Here’s an example using the responses package (SomeClass is defined the same way as in the first example):

import responses
import json
MOCK_JSON_DATA = json.dumps({'collections': [{'id': 1}, {'id': 2}]})

@responses.activate
def test_some_class():
  responses.add(responses.GET,  'https://repository.library.brown.edu/api/collections/',
 body=MOCK_JSON_DATA,
 status=200,
 content_type='application/json'
 )
  thing = SomeClass()
  assert thing.get_collection_ids() == [1, 2]

test_some_class()

Record & Replay Data

A different type of solution is to use a package to record the responses from your HTTP calls, and then replay those responses automatically for you.

VCR.py – VCR.py is a Python version of the Ruby VCR library, and it supports various HTTP clients, including requests.

Here’s a VCR.py example, again using SomeClass from the first example:

import vcr 
IDS = [674, 278, 280, 282, 719, 300, 715, 659, 468, 720, 716, 687, 286, 288, 290, 296, 298, 671, 733, 672, 334, 328, 622, 318, 330, 332, 625, 740, 626, 336, 340, 338, 725, 724, 342, 549, 284, 457, 344, 346, 370, 350, 656, 352, 354, 356, 358, 406, 663, 710, 624, 362, 721, 700, 661, 364, 660, 718, 744, 702, 688, 366, 667]

with vcr.use_cassette('vcr_cassettes/cassette.yaml'):
  thing = SomeClass()
  fetched_ids = thing.get_collection_ids()
  assert sorted(fetched_ids) == sorted(IDS)

betamax – From the documentation: “Betamax is a VCR imitation for requests.” Note that it is more limited than VCR.py, since it only works for the requests package.

Here’s a betamax example (note: I modified the code in order to test it – maybe there’s a way to test the code with betamax without modifying it?):

import requests

class SomeClass:
    def __init__(self, session=None):
        self.data = self._fetch_data(session)

    def _fetch_data(self, session=None):
         if session:
             r = session.get('https://repository.library.brown.edu/api/collections/')
         else:
             r = requests.get('https://repository.library.brown.edu/api/collections/')
         return r.json()

    def get_collection_ids(self):
        return [c['id'] for c in self.data['collections']]


import betamax
CASSETTE_LIBRARY_DIR = 'betamax_cassettes'
IDS = [674, 278, 280, 282, 719, 300, 715, 659, 468, 720, 716, 687, 286, 288, 290, 296, 298, 671, 733, 672, 334, 328, 622, 318, 330, 332, 625, 740, 626, 336, 340, 338, 725, 724, 342, 549, 284, 457, 344, 346, 370, 350, 656, 352, 354, 356, 358, 406, 663, 710, 624, 362, 721, 700, 661, 364, 660, 718, 744, 702, 688, 366, 667]

session = requests.Session()
recorder = betamax.Betamax(
 session, cassette_library_dir=CASSETTE_LIBRARY_DIR
 )

with recorder.use_cassette('our-first-recorded-session', record='none'):
    thing = SomeClass(session)
    fetched_ids = thing.get_collection_ids()
    assert sorted(fetched_ids) == sorted(IDS)

Integration Test

Note that with all the solutions I listed above, it’s probably safest to cover the HTTP calls with an integration test that interacts with the real service, in addition to whatever you do in your unit tests.

Another possible solution is to test as much as possible with unit tests without testing the HTTP call, and then just rely on the integration test(s) to test the HTTP call. If you’ve constructed your application so that the HTTP call is only a small, isolated part of the code, this may be a reasonable option.

Here’s an example where the class fetches the data if needed, but the data can easily be put into the class for testing the rest of the functionality (without any mocking or external packages):

import requests

class SomeClass:

    def __init__(self):
        self._data = None

    @property
    def data(self):
        if not self._data:
            r = requests.get('https://repository.library.brown.edu/api/collections/')
            self._data = r.json()
        return self._data

    def get_collection_ids(self):
        return [c['id'] for c in self.data['collections']]


import json
MOCK_DATA = {'collections': [{'id': 1}, {'id': 2}]}

def test_some_class():
    thing = SomeClass()
    thing._data = MOCK_DATA
    assert thing.get_collection_ids() == [1, 2]

test_some_class()

Upgrades and Architecture changes in the BDR

Recently we have been making some architectural changes in the BDR. One big change was migrating from RHEL 5 to RHEL 7, but we also moved from basically a one-server setup to four separate servers.

RHEL 5 => RHEL 7

RHEL 5 support ended in March, so we needed to upgrade. We initially got a RHEL 6 server, but then decided to upgrade to RHEL 7, which will give us longer before we have to upgrade again. Moving to RHEL 7 lets us use more up-to-date software like Redis 2.8.19, instead of 2.4.10, but the biggest issue is that security updates are no longer available for RHEL 5.

Added a Server for Loris

We started using Loris back in the fall. We installed Loris on a new server, and eventually we shut down our previous image server that was running on the same server as most of our other services.

Added Servers for Fedora & Solr

We also added a new server for Solr, and then a new server for Fedora. These two services previously ran on the one server that handled almost everything for the BDR, but now each one is on its own server.

Our fourth server is also RHEL 7 now – that’s where we moved our internet-facing services.

Pros & Cons

One advantage of being on four servers is the security we get from having our services isolated. Processes can be firewalled and blocked on the same server based on different users, firewall rules, … but having our backend servers firewalled off from the Internet and separated from each other encourages better security practices.

Also, the resources our services use are separated. If one service has an issue and starts using all the CPU or memory, it can’t take resources from the other services.

One downside of using four servers is that it increases the amount of work to setup and maintain things. There are four servers to setup and install updates on, instead of one. Also, the correct firewall rules have to be setup between the servers.

Django vs. Flask Hello-World performance

Flask and Django are two popular Python web frameworks. Recently, I did some basic comparisons of a “Hello-World” minimal application in each framework. I compared the source lines of code, disk usage, RAM usage in a running process, and response times and throughput.

Lines of Code

Both Django and Flask applications can be written in one file. The Flask homepage has an example Hello-World application, and it’s seven lines of code. The Lightweight Django authors have an example one-page application that’s 29 source lines of code. As I played with that example, I trimmed it down to 17 source lines of code, and it still worked.

Disk Usage

I measured disk usage of the two frameworks by setting up two different Python 3.6 virtual environments. In one, I ran “pip install flask”, and in the other I ran “pip install django.” Then I ran “du -sh” on the whole env/ directory. The size of the Django virtual environment was 54M, and the Flask virtual environment was 15M.

Here are the packages in the Django environment:

Django (1.11.1)
pip (9.0.1)
pytz (2017.2)
setuptools (28.8.0)

Here are the packages in the Flask environment:

click (6.7)
Flask (0.12.1)
itsdangerous (0.24)
Jinja2 (2.9.6)
MarkupSafe (1.0)
pip (9.0.1)
setuptools (28.8.0)
Werkzeug (0.12.1)

Memory Usage

I also measured the RAM usage of both applications. I deployed them with Phusion Passenger, and then the passenger-status command told me how much memory the application process was using. According to Passenger, the Django process was using 18-19M, and the Flask process was using 16M.

Loading-testing with JMeter

Finally, I did some JMeter load-testing for both applications. I hit both applications with about 1000 requests, and looked at the JMeter results. The response time average was identical: 5.76ms. The Django throughput was 648.54 responses/second, while the Flask throughput was 656.62.

Final remarks

This was basic testing, and I’m not an expert in this area. Here are some links related to performance:

Slides from a conference talk
Blog post comparing performance of Django on different application servers, on different versions of Python

Storing Embargo Data in Fedora

We have been storing dissertations in the BDR for a while. Students have the option to embargo their dissertations, and in that case we set the access rights so that the dissertation documents are only accessible to the Brown community (although the metadata is still accessible to everyone). The problem is that embargoes can be extended upon request, so we really needed to store the embargo extension information.

We wanted to use a common, widely-used vocabulary for describing the embargoes, instead of using our own terms. We investigated some options, including talking with Hydra developers on Slack, and emailing the PCDM community. Eventually, we opened a PCDM issue to address the question of embargoes in PCDM. As part of the discussion and work from that issue, we created a shared document that lists many vocabularies that describe rights, access rights, embargoes, … Eventually, the consensus in the PCDM community was to recommend the PSO and FaBiO ontologies (part of the SPAR Ontologies suite), and a wiki page was created with this information.

At Brown, we’re using the “Slightly more complex” option on that wiki page. It looks like this:

<pcdm:Object> pso:withStatus pso:embargoed .

<pcdm:Object> fabio:hasEmbargoDate “2018-11-27T00:00:01Z”^^xsd:dateTime .

In our repository, we’re not on Fedora 4 or PCDM, so we just put statements like these in the RELS-EXT datastream of our Fedora 3 instance. It looks like this:

<rdf:RDF xmlns:fabio=“http://purl.org/spar/fabio/#” xmlns:pso=“http://purl.org/spar/pso/#” xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
<rdf:Description rdf:about=“info:fedora/test:230789”>
<pso:withStatus rdf:resource=“http://purl.org/spar/pso/#embargoed”></pso:withStatus>
<fabio:hasEmbargoDate>2018-11-27T00:00:01Z</fabio:hasEmbargoDate>
<fabio:hasEmbargoDate>2020-11-27T00:00:01Z</fabio:hasEmbargoDate>
</rdf:Description>
</rdf:RDF>

In the future, we may want to track various statuses for an item (eg. dataset) over its lifetime. In that case, we may move toward more complex PSO metadata that describes various states that the item has been in.

Fedora 4 – testing

Fedora 4.7.1 is scheduled to be released on 1/5/2017, and testing is important to ensure good quality releases (release testing page for Fedora 4.7.1).

Sanity Builds

Some of the testing is for making sure the Fedora .war files can be built with various options on different platforms. To perform this testing, you need to have 3 required dependencies installed, and run a couple commands.

Dependencies

Java 8 is required for running Fedora. Git is required to clone the Fedora code repositories. Finally, Fedora uses Maven as its build/management tool. For each of these dependencies, you can grab it from your package manager, or download it (Java, Git, Maven).

Build Tests

Once your dependencies are installed, it’s time to build the .war files. First, clone the repository you want to test (eg. fcrepo-webapp-plus):

git clone https://github.com/fcrepo4-exts/fcrepo-webapp-plus

Next, in the directory you just created, run the following command to test building it:

mvn clean install

If the output shows a successful build, you can report that to the mailing list. If an error was generated, you can ask the developers about that (also on the mailing list). The generated .war files will be installed to your local Maven repository (as noted in the output of the “mvn clean install” command).

Manual Testing

Another part of the testing is to perform different functions on a deployed version of Fedora.

Deploy

One way to deploy Fedora is on Tomcat 7. After downloading Tomcat, uncompress it and run ./bin/startup.sh. You should see the Tomcat Welcome page at localhost:8080.

To deploy the Fedora application, shut down your tomcat instance (./bin/shutdown.sh) and copy the fcrepo-webapp-plus war file you built in the steps above to the tomcat webapps directory. Next, add the following line to a new setenv.sh file in the bin directory of your tomcat installation (update the fcrepo.home directory as necessary for your environment):

export JAVA_OPTS=”${JAVA_OPTS} -Dfcrepo.home=/fcrepo-data -Dfcrepo.modeshape.configuration=classpath:/config/file-simple/repository.json”

By default, the fcrepo-webapp-plus application is built with WebACLs enabled, so you’ll need a user with the “fedoraAdmin” role to be able to access Fedora. Edit your tomcat conf/tomcat-users.xml file to add the “fedoraAdmin” role and give that role to whatever user you’d like to log in as.

Now start tomcat again, and you should be able to navigate to http://localhost:8080/fcrepo-webapp-plus-4.7.1-SNAPSHOT/ and start testing Fedora functionality.

Django project update

Recently, I worked on updating one of our Django projects. It hadn’t been touched for a while, and Django needed to be updated to a current version. I also added some automated tests, switched from mod_wsgi to Phusion Passenger, and moved the source code from subversion to git.

Django Update

The Django update didn’t end up being too involved. The project was running Django 1.6.x, and I updated it to the Django LTS 1.8.x. Django migrations were added in Django 1.7, and as part of the update I added an initial migration for the app. In my test script, I needed to add a django.setup() for the new Django version, but otherwise, there weren’t any code changes required.

Automated Tests

This project didn’t have any automated tests. I added a few tests that exercised the basic functionality of the project by hitting different URLs with the Django test client. These tests were not comprehensive, but they did run a signification portion of the code.

mod_wsgi => Phusion Passenger

We used to use mod_wsgi for serving our Python code, but now we use Phusion Passenger. Passenger lets us easily run Ruby and Python code on the same server, and different versions of Python if we want (eg. Python 2.7 and Python 3). (The mod_wsgi site has details of when it can and can’t run different versions of Python.)

Subversion => Git

Here at the Brown University Library, we used to store our source code in subversion. Now we put our code in Git, either on Bitbucket or Github, so one of my changes was to move this project’s code from subversion to git.

Hopefully these changes will make it easier to work with the code and maintain it in the future.

Python/Django Quicktips: Ordered JSON Load and Django Email Testing

Ordered JSON Load

Recently, I had the need to load some data from our JSON Item API in the same order it was created. When we construct the data, we use an OrderedDict to preserve the order and then we dump it to JSON.

In [1]: import json
In [2]: from collections import OrderedDict
In [3]: info = OrderedDict()
In [4]: info['zebra'] = 1
In [5]: info['aardvark'] = 10

In [6]: info
 Out[6]: OrderedDict([('zebra', 1), ('aardvark', 10)])

In [7]: json.dumps(info)
 Out[7]: '{"zebra": 1, "aardvark": 10}'

By default, though, the JSON module loads that data into a regular dict, and the order is lost.

In [8]: json.loads(json.dumps(info))
 Out[8]: {u'aardvark': 10, u'zebra': 1}

What’s the solution? Tell the json module to load the data into an OrderedDict:

In [9]: json.loads(json.dumps(info), object_pairs_hook=OrderedDict)
 Out[9]: OrderedDict([(u'zebra', 1), (u'aardvark', 10)])

Django email testing

Some of our django projects send out notification emails, to a user or a site admin. Django has the handy mail_admins and send_mail functions, but what if you want to test that the email was sent?

Django makes it easy to unit-test the emails – its test runner automatically uses a dummy email backend. Then you can import the mail outbox and verify its contents. Here’s a code snippet that tests an email being sent:

from django.core.mail import send_mail
def send_email():
    send_mail('Blog post', 'Test for the blog post',  digital_technologies@brown.edu',
 ['public@example.com'], fail_silently=False)

from django.test import SimpleTestCase
from django.core import mail

class TestEmail(SimpleTestCase):

   def test_email(self):
       send_email()
       self.assertEqual(len(mail.outbox), 1)
       self.assertEqual(mail.outbox[0].subject, 'Blog post')
       self.assertEqual(mail.outbox[0].body, 'Test for the blog post')

Note: you can’t import outbox from django.core.mail and check that len(outbox) == 1. This is because outbox is just a list, and it gets re-initialized to a new list before each test case.