Deploying Django without downtime

One of my active projects is abiapp.net, a SaaS content management tool for the collaborative creation of yearbook contents, primarily targeted to German high scool students. While we currently have few enough users to run all of abiapp.net on a single machine, we have enough users that at any time of the day one of us is working on the project, a user is online. We have a philosophy of pushing new code in very small intervals, days with multiple updates on the same day are nothing uncommon.

However, I do not enjoy pushing new code if I know that all users currently using the website will be interrupted – pretty much everything in our application happens in real-time, so users will surely recognize a downtime of up to a minute. There would have been several possibilities to shorten this time frame but I have decided to work on a solution without any measurable downtime at all.

The general setup

Our application is implemented in Python using Django and runs on a Debian stable virtual machine. The Django application lives inside a gunicorn instance, which is a lightweight HTTP server on top of the WSGI protocol you should use if you do web stuff in Python. On top of all, there is a nginx webserver which servers static content and acts as a proxy for gunicorn. We use MySQL as our database backand and we have a background task queue for long-running tasks powered by Celery and RabbitMQ as a message broker. The gunicorn instance is being controlled by supervisord.

The old deployment setup

We currently deploy using git. We have a git remote on our production server which has a post-receive hook which executed the following tasks:

Load the new source code into the working direcotry
Make a database backup (we learned this the hard way)
Perform database migrations
Compile LESS to CSS
Compress static files
Execute unit tests, just to be sure
Reload gunicorn

However, this setup has some very huge problems. The biggest one is that in the moment we load our new code into the working directory, Django will use our new templates and static files even though we are still running on the old python code. This is already bad, but it gets way worse in the unlikely event that the unit tests fail and the new python code is not loaded – then we're stuck in this intermediate state of broken things.

The new deployment setup

We now have two completely independent instances of the application. We have have our git repository three times on the production server:

$ ls app.*
app.src/
app.run.A/
app.run.B/

app.src is the bare git repository we push our changes to and app.run.A and app.run.B are two copies of it used for running the application. The application always runs twice:

$ supervisorctl status
abiapp.A       RUNNING    pid 6184, uptime 0:00:02
abiapp.B       RUNNING    pid 6185, uptime 0:00:02

One of those processes runs with the code and templates from app.run.A, one with the other. They listen on different sockets and supervisord knows them as distinct services.

We also have two copies of our nginx webserver config, one of them pointing to the socket of process A and one to the socket of process B. Only one of them is enabled at the same time:

$ ls /etc/nginx/sites-available/
abiapp.net-A
abiapp.net-B
$ ls /etc/nginx/sites-enabled/
abiapp.net-A

The nginx config

The nginx configuration looks a bit like this:

upstream abiapp_app_server_A {
    server unix:/home/abiapp/run/gunicorn.A.sock fail_timeout=0;
}

server {
    listen 443;
    server_name .abiapp.net;

    # … SSL foo …

    location /static/ { # The static files directory
        alias /var/www/users/abiapp/www.abiapp.net/static.A/;
        access_log off;
        expires 7d;
        add_header Cache-Control public;
        add_header Pragma public;
        access_log off;
    }

    location /media/ {
        # …
    }

    location / { # The application proxy
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $http_host;
        proxy_redirect off;
        if (!-f $request_filename) {
            proxy_pass http://abiapp_app_server_A;
            break;
        }
        root /var/www/users/abiapp/www.abiapp.net/;
    }

    # error pages …
}

The git hook

So our post-receive hook has to find out which of those two processes currently serves users and replace the other process with the new code. We'll now walk through this git hook pice by piece.

The first part determines which instance is running by looking into a text file containing the string A or B. We'll write the current instance name into this file later in our hook.

#!/bin/bash
if [[ "$(cat /home/abiapp/run/selected)" == "B" ]]; then
    RUNNING="B"
    STARTING="A"
else
    RUNNING="A"
    STARTING="B"
fi

We now activate the virtual python environment our application lives in and move to the application's source code.

unset GIT_DIR
source /home/abiapp/appenv/bin/activate
cd /home/abiapp/app.run.$RUNNING/src

First of all, we'll do a backup, just to be sure. You could also use mysqldump here.

echo "* Backup"
python manage.py dumpdata > ~/backup/dump_$(date +"%Y-%m-%d-%H-%M")_push.json

We now pull the new source code into our current directory.

echo "* Reset server repository"
git reset --hard
git pull /home/abiapp/app.src master || exit 1
cd src

Then, we perform database migrations and deploy our static files.

echo "* Perform database migrations"
python manage.py migrate || exit 1
echo "* Deploy static files"
python manage.py collectstatic --noinput || exit 1
echo "* Compress static files"
python manage.py compress || exit 1

Note: The two source code directories have slightly different Django configuration files: Their STATIC_ROOT settings point to different directories.

We now perform our unit tests on the production server:

echo "* Unit Tests"
python manage.py test app || exit 1;

And finally restart the gunicorn process.

echo "* Restart app"
sudo supervisorctl restart abiapp.$STARTING

Remember, we just restarted a process which was not visible to the Internet, we replaced the idling one of our instances. Now the time has come to reload our webserver:

echo "* Reload webserver"
sudo /usr/local/bin/abiapp-switch $STARTING

The abiapp-switch script does no more than replacing the link in /etc/nginx/sites-enabled by the other configuration file and then calling service nginx reload.

This is the moment our new code goes live. On the reload call, nginx spawns up new workers using the new configuration¹. All old workers will finish their current requests and then shut down, so that there really is no measurable downtime. To have the hook complete, we restart celery (which waits for all running workers to finish their tasks, then restarts with the new code):

echo "* Restart task queue"
sudo service celeryd restart

And finally we report success and store the name of the newly running instance.

echo "Done :-)"
echo "Instance $STARTING is running now"
echo $STARTING > /home/abiapp/run/selected

So we're done here. As you may have noticed, all early steps of the hook included an || exit 1. In case of a failed migration, unit test or compression, the whole process would just abort and leave us virtually unharmed, as the working instance keeps running.

A word on database migrations

As you may have noticed, we still have one flaw in our workflow: The database migrations are applied some time before the new code is running. The only really clean solution is to split each of your 'destructive' database migrations into multiple deployment iterations: If you for example remove a field from a model (a column from a table), you'd first push the new code with all usage of the field being removed and then, in a second push, you'd deploy the database migration which removes the column.

The nginx documentation on the reload command ↩