Django / Increasing Performance

Increasing Django Performance

By Marcelo Fernandes Nov 7, 2017


What will you learn:


  • How to debug your page and look for fragile spots.
  • How to do a better usage of your djang ORM by loading only the important information.
  • How to work with F() objects.
  • DB tricks.
  • Using Django Cache.
  • Going "Raw" with SQL.


Basics


If you program in python and have been doing some web development with Django, you must have faced some questions like: "Why don' you use NodeJS? Django is so slow, python itself is slow!", or "Django is such a huge boy, why don't you go with flask? I'm sure it's gonna be better!". And so on...

It is most likely that these questions don't come along with deep thinking about those tools in general. But even though, we (Django programmers) don't follow strict lines of good practice, or more further, we don't even know that those rules exists and where to find them! So, sometimes we mess with our code base and we perpetuate this reputation.

I'm writing this post, having in mind that it might help the community achieving a better usage of django and (maybe) you can run a more performatic django app!



Knowing where the problem is.


This is the most complicated part. While I was working on some applications, I realized that sometimes it's hard to spot where the big problems are, and where the lack of performance is hidden. When your project starts to increase, you might end up with a developer team of more than 50 people and having more than 70 django apps, How easy it's gonna be to measure everything? Should I wait for the code reviews to get the chance to step into a bug, or bad written query? But what if there are some fat queries in the middle of the code and we never know about it, because we just forgot that? Maybe the slow pices of code are present only in the template-rendering, who would possible know?

I guess it is pretty clear that we gotta have a good debuging tool. At least we should have one that can take a note about how many queries are being executed on a certain page, how long those queries take to execute, how many hits on the cache system we did in order to retrieve chunks of code, and so on.

It turns out we have just the tool that does it, in a pretty cool way, it's been there for a while and it's recommended even by django itself. So let's take a look at the Django Debug Toolbar.



First Level: Finding Issues with Django Debug Toolbar



Django Debug Toolbar is a must have. This tool helps a lot and there are tons of extensions to help you validate how good/bad your time responses are. The installation is pretty straight forward as stated on their documentation web-site. And please feel free to contribute to it as well.

The usage is interesting. You can see how many queries were executed by a certain page, and you can dive into it. You can get information about the raw SQL that run over your application, you can see how long your queries took in order to ran and retrieve the results. With some extensions you can even check up how many hits your cache system took, including the name of the cache keys that were being used and if it's a get, set or whatever.

With this tool in hands you can keep track of what you are doing with your code so far, or maybe what your devs are doing so far (lol).

A tip that I'd like to give: Whenever you are checking for a webpage (or endpoint), try to focus if there is any duplicated query on your code (believe me, it happens pretty often), or if some queries that could be using inner joins are turned up into a whole new select (we will discuss it later), those are the spots that you want to find out and optimize it. Another rule that I like to use, always check if a single page is hiting the cache less than 10 times, more than that can scale to something bad to manage in the future.



Second Level: Fixing the code after finding the initial problems


For the second level I'm not able to help you that much, you have to dive in and see where in your code you have to optimize your stuff. Remember that you might be inheriting problems from other places (such as mixins and high level templates). Take a look at the debug toolbar traceback and see if you can come up with a good idea about where you should look for.



Third Level: Avoid querying the db (As much as you can). Usign select_related(), prefetch_relate() and defer()


Remember: Django is lazy, it means that it only hits the db whenever you evaluate a query. Try to get advanced on the Django ORM, and take advantage of it: Use select_related() and keep in mind that the SQL SELECT statement is an expensive operation, and when you use select_related() the data from the foreign keys related to your models instances are retrieved in a single SELECT, check it out:

This example shows the differences between the plain lookups and the select_related(), let's say that you have an Owner model, that relates do a Dog model.


# Hits the database.
owner = Owner.objects.get(id=5)

# Hits the database again to get the related Dog object.
dog = owner.dog

And then, how it should be:


# Hits the database.
owner = Owner.objects.select_related('dog').get(id=5)

# Doesn't hit the database, because owner.dog has been prepopulated
# in the previous query.
dog = owner.dog

Another trick, with prefetch_related(). This makes possible to have the same functionality as select_related(), but extended to many-to-many and many-to-one relationships.

Take a look at this example:


from django.db import models

class Author(models.Model):
    name = models.CharField(max_length=30)

class Book(models.Model):
    name = models.CharField(max_length=50)
    authors = models.ManyToManyField(Author)

    def __str__(self):
        return "%s (%s)" % (
            self.name,
            ", ".join(author.name for author in self.authors.all()),
        )

If we run the query:


Book.objects.all()
# ["Awesome Book (Maria, John)", "Another Book ("John, Mark)", ....]

The problem here is that whenever we ask for a book via Book.objects.all() the Book.__str__() asks for all the authors objects, and it happens for every single Book instance. We want to avoid that, by having only two queries, one hitting the books, and another the authors.

The solution is pretty simple, just run:


Book.objects.all().prefetch_related('authors')

Just select the column of your table that is being used as a many-to-many or many-to-one relationship, and now each time that we run self.authors.all() the ORM will look up for a prefetched queryset that is stored on the cache, rather than going to the table again and make an infinity of queries. (That might save some time).

And finally, use: defer() to retrieve only that columns on your models that matter for you at the moment. Sometimes we have tables that are huge, and have a lot of fields, even pdf, videos, images, etc... But often, you only need a few columns, and obviously you don't want to expend too much time on your query trying to look for data that you aren't going to use by any means. So let's get rid of that with defer()

Take a look at an example:


Map.objects.defer('pdf').filter(area__gt=5000)

In this case we don't want to load the pdf column from the database when we make a query to our Map model, let's just skip this field and move on. (You can take a look at the method only() as well, which is the "inverse" of defer() and returns only the fields that we want to use).



Fourth Level: Use F() expressions instead of doing database work on python


This feature is so unknown and underrated, but it can boost your performance in such an amazing way.

The F() object is pretty useful when you don't want to bring a model to the python memory and all you want is just to modify some data directly on sql level. When you use F(), the Django ORM will generate a SQL expression that is capable of updating a field directly, so you don't have to pull out data from your database, make the modifications and then push it in, you can do all in a single query.

Example: Let's say you have a Student model, and you wan't to increase the grade of a certain student based on his efforts during the class

Naturally, you would do something like:

student = Student.objects.get(name='Peter')
student.grade += 1.0
student.save()

In the example above, we have pulled out our student from the database into python memory and manipulated it using python operations to finally save it, In resume we did two queries, one with a select and another that saves the data on the database. We also lost sometime during the python assignment operation. But we can improve that with F(), check it out:


from django.db.models import F

student = Student.objects.get(name='Peter')
student.grade = F('grade') + 1
student.save()

It looks like an ordinary python operation, but whenever django step on a F() instance, it will encapsulate the the SQL expression with the additional data or expression that we are using, instead of hiting the DB beforehand, in this case it encapsulates an SQL expression that increments the field 'grade' of a certain student.

In the example above, we still have two SQL operations, one for the get and another for the save, but we are not saving data on python memory anymore. BUT, we can improve even more!


student = Student.objects.filter(name='Peter')
student.update(grade=F('grade') + 1)

Looks good now! All in one query, no data being loaded on python and manipulated as well, cool. Before finishing, one last tip:

F() instances persists. (What does it mean?)


student.grade                      # grade = 5
student.grade = F('grade') + 1
student.save()                     # grade = 6
student.name = 'Lucas'
student.save()                     # grade = 7

Whenever you hit save again, your instance is gonna be updated with the F() encapsulation. To avoid that, you should either call the object again OR


student.grade                      # grade = 5
student.grade = F('grade') + 1
student.save()
print(student.grade)               # <CombinedExpression: F(grade) + Value(1)>
student.refresh_from_db            # grade = 6

Not so bad, right? But hold on that the best is yet to come:



Fifth Level: Use indexes on your tables


Queries that will run over unique or indexed columns, will run much faster than the ones that aren't unique neither indexed. It happens essentialy because the SQL interpreter will think that there might be another field that resolves for the Query, in other hands if the field already has an index, it will look straight to this index, and will return it much faster.

Example:


# Using the "Car model"
from django.db import models

class Car(models.Model):
    name = models.CharField(max_length=60)
    brand = models.CharField(max_length=20)


# This query
car = Car.objects.get(id=1000)
# Is much faster than:
car = Car.objects.get(name='Ford Ka')

It happens because the id, or primary key, is an unique key by default, and is indexed by the database. In other hands, the field name is neither an unique field nor an indexed field, which resumes into a slower query.

So how to fix that? Let's index our name column, it is pretty easy


class Car(models.Model):
    name = models.CharField(db_index=True, max_length=60)
    brand = models.Charfield(db_index=True, max_length=20)

Just remember that you will have a more slower update and insertion rate, because every time you do it, your indexing has to be rebuilt, but usually it pays off (But don't trust me! Test it!).



Sixth Level: Still slow? Start using the Cache


Remember, caching is not worth to use in order to improve code that is already bad, instead of using cache as a make-up for bad code, start using it as one of the last steps to increase code performance, so at least you guarantee that you gave a shoot. And chaching is not to be treated just like a lazy shortcut towards fast performance. And if done wrong, it can cause you more problems than you can imagine. (Code refactoring for cached pieces of code might get your developers a little bit afraid of touching it for being something not so trivial to debug and measure, therefore you might end up with a legacy on the middle of your repo).

So, what does Django has to offer me?

Well, django has one of the most robust cache system when compared to other frameworks, it will let you use cache from the high level perspective (by just caching everything) to the low level perspective(by setting different cache blocks on your template, for example).

The django documentation already makes some cool statements about the Cache set-up, therefore let's just dive into the interesting stuff:

Caching Everything:

I don't like the idea of caching everything, because you may lose control of your application while doing it. But there are scopes in which it might be an interesting idea. Take this site (marcelonet) for instance. It's a regular blog-ish website, and there is not a lot of dynamic and mutual dependent stuff, it's just a really, really, simple django application running some templates. Well, this is a good place to cache everything!

So, in order to cache everything, what's the first idea that pops up your mind? Yea! Making a middleware that captures everything and see if the page is already cached, and if no, just cache it!

You must have something like that on your settings.py Middleware section:


MIDDLEWARE = [
    'django.middleware.cache.UpdateCacheMiddleware',
    # ... middlewares that change the Vary header.
    'django.middleware.cache.FetchFromCacheMiddleware',
]

CACHE_MIDDLEWARE_SECONDS = 900  # Cache timeout (in seconds)

It's easy like that, but you may wonder... What is a Vary header? Well... Let's say that you want to cache your entire site, but you want to have some control over your cache. You might have a view that you want to cache for users with different Authorization tokens, which means that for every different token, your cache will save on its storage a different version of the same view. It can be done like that:


from django.views.decorators.vary import vary_on_headers

@vary_on_headers('Authorization')
def my_view(request):
    ...

And now you have your entire site cached, and with certain control degree.

Caching per-view

This is my favourite, you can have much more control over your application, and cache only the real places in which you need performance improvement.

Take a look at this example:

Your views.py

class MyClassView:
    def get(self):
        # ...
    def post(self):
        # ...

Your urls.py

from django.views.decorators.cache import cache_page

url(r'^my_url/?$', cache_page(60*60)(MyClassView.as_view())),

Why not use the cache_page decorator in a function view or in a method? Well, if you want to inherit this class somehow, you will be tied up to this decision, so it's better to put directly into the url loader.

Caching on the template

It's very rare to find out a project in which you will cache some fragments of your template, and nowadays with the popularity of API's, the usage of the templates aren't so popular. But still, there is a lot of applications over there where this pattern finds its sense. Specially on high-speed-demanding applications in which it's more economic to computate everything and throw to the client, than relying on the client's machine to compute dynamic stuff for you. It might even increase your SEO authority, if this is something that bothers you.

Let's imagine an example: You have a page in which you display some posts, and you want to have control of what you cache and for how long, since some fragments of your web site can be cached for different amounts of times and ways, you want to have control of that.


    {% load cache %}
    {% cache 900 'article_text' article.id article.modified %}
        ...
    {% endcache %}
        ...
    {% cache 90000 'article_footer article.id article.modified %}
        ....
    {% endcache %}

What we say here is: For each article that has a different id or has a different modification date, we want to cache for 900 seconds, but our footer doesn't need to be cached so often, so we stick with 90000 seconds.

You gain more control, but it costs you having a bit of extra logic on your template.

You can access the cache directly using: from django.core.cache import cache which might be handy when you only want to cache some calculations inside your view, but not the entire thing.


from django.core.cache import cache
    
cache.get_or_set('my_new_key', 'my new value', 100)

Read more at the documentation



Final level: Go SQL


You are mad. Your ORM queries can't be improved, you tried everything, you also cached your stuff, but even so, even before caching your queries run slow and then your entire application run slow. Well, time to open the connection to the database and optimize your queries by yourself:


from django.db import connection

def my_custom_sql(self):
    with connection.cursor() as cursor:
        cursor.execute("UPDATE bar SET foo = 1 WHERE bar = %s", [self.bar])
        cursor.execute("SELECT foo FROM bar WHERE bar = %s", [self.bar])
        row = cursor.fetchone()

    return row

Take care! Using a lot of sql code on the middle of your platform may end up having a hard to mantain website. Use it wisely and on the places where it is really needed.



Final Thoughts:


I hope that today you learnt something new, and that you are able to make some performatic django apps. If you find something interesting, or you think that this post is missing something, please let me know on the comments section! I will be more than glad to update this documentation.


Cheers!