Basic Performance Optimization in Django

Python has this cool little module called timeit that tells you how long it takes to execute a function.

In [12]: timeit(get_books_by_library_id, number=10)Out[12]: 6.

598360636999132In [13]: timeit(get_books_by_library_id_one_query, number=10)Out[13]: 0.

677092163998168Even though our second function has to loop over 10,000 books, it still runs nearly 10x faster than the original function by eliminating round trips to the database.

Keep in mind this all depends on the sparseness/density and scale of your data.

In my experience, code generally runs slower if the number of SQL queries made scales with some other input.

select_related()Imagine you wanted a list of strings in the format of Harry Potter and the Sorcerer's Stone by J.

K.

Rowling corresponding to each library.

Django makes it easy to write code that executes unnecessary queries.

If you don’t know what’s happening under the hood or you’re working with unfamiliar models, you might write something like this:def get_books_by_author(): books = Book.

objects.

all() result = defaultdict(list) for book in books: author = book.

author title_and_author = '{} by {}'.

format( book.

title, author.

name ) result[book.

library_id].

append(title_and_author) return resultHere’s the problem: each time you access book.

author you’re making a query equivalent to Author.

objects.

get(id=book.

author_id).

If you’re looping over thousands of books you’re also making thousands of completely unnecessary queries.

Use select_related to avoid this:def get_books_by_author_select_related(): books = Book.

objects.

all().

select_related('author') result = defaultdict(list) for book in books: author = book.

author title_and_author = '{} by {}'.

format( book.

title, author.

name ) result[book.

library_id].

append(title_and_author) return resultselect_related works by making a single more complex SQL query which also returns the fields of the related object.

So instead of just getting back data about all of the books, you’re also grabbing data for each book’s author as well.

Now when you access book.

author you’re actually accessing a cached version of the author rather than making a separate database query.

How much faster is the naive approach vs the approach using select_related?.I ran both of these functions using the timeit module and found that the function using select_related was 32x faster:In [12]: timeit(get_books_by_author, number=10)Out[12]: 41.

363460485998075In [13]: timeit(get_books_by_author_select_related, number=10)Out[13]: 1.

2787263889913447select_related wont work in all situations however:to avoid the much larger result set that would result from joining across a ‘many’ relationship, select_related is limited to single-valued relationships – foreign key and one-to-one.

In order to traverse reverse ForeignKey or ManyToMany relationships we’ll need prefetch_related.

prefetch_related()prefetch_related is similar to select_related in that it prevents unnecessary SQL queries.

Instead of fetching the primary and related objects in one go, prefetch_related makes separate queries for each relationship and “joins” the results together in python.

The downside of this approach is it requires multiple round trips to the database.

Author.

objects.

filter(name__startswith='R').

prefetch('books')How it works: first a request is fired off which runs the primary query Author.

objects.

filter(name__startswith=letter).

Then a second query equivalent to: Book.

objects.

filter(author_id__in=PKS_OF_AUTHORS_FROM_FIRST_REQUEST) is executed and both of the responses are merged together into an Author queryset that has each author’s books cached in memory.

So you end up with a similar result as with select_related but you arrive there through different means.

Although you can use prefetch_related anywhere you use select_related, generally your code will run faster with select_related since it will grab everything you need in one SQL query.

But if your data is particularly sparse (millions of books to just a few libraries) you may see a performance boost despite the extra trips to the database.

So when in doubt, try both methods and see which one comes out on top.

To summarize: use select_related and prefetch_related if you’re going to access a model’s relationship after making an initial query.

Use prefetch_related if you’re following a reverse ForeignKey or ManyToMany relationship.

Use select_related if you’re following a ForeignKey or OneToOne relationship but keep in mind that prefetch_related could be a better choice for these types of relationships depending on the nature of your data.

values() and values_list()The time it takes to serialize SQL responses into python scales with both the number of rows and columns being returned.

In the function below, we’re serializing all of the fields defined on the book and author even though we only need the author’s name and the book’s library_id and title.

We’re also are initializing a Django model instance for no reason since we’re not doing anything special with it (like calling model methods).

def get_books_by_author_select_related(): books = Book.

objects.

all().

select_related('author') result = defaultdict(list) for book in books: author = book.

author title_and_author = '{} by {}'.

format( book.

title, author.

name ) result[book.

library_id].

append(title_and_author) return resultBecause of this, we incur pretty significant overhead that can be eliminated by calling .

values() or .

values_list() on the queryset to ask only for the fields we need:def get_books_by_author_select_related_values(): books = ( Book.

objects .

all() .

select_related('author') .

values('title', 'library_id', 'author__name') ) result = defaultdict(list) for book in books.

iterator(): title_and_author = '{} by {}'.

format( book['title'], book['author__name'] ) result[book['library_id']].

append(title_and_author) return resultdef get_books_by_author_select_related_values_list(): books = ( Book.

objects .

all() .

select_related('author') .

values_list('title', 'library_id', 'author__name') ) result = defaultdict(list) for book in books.

iterator(): title_and_author = '{} by {}'.

format( book[0], book[2] ) result[book[1]].

append(title_and_author) return result.

values() returns a list of dictionary representations of the model instance: [{'title': 'Snow Crash', 'library_id': 9, 'author__name': 'Neil'}, .

] and .

values_list() returns a list of tuples representing the model instance [('Snow Crash', 9, 'Neil'), .

].

So how much faster are these functions?.By only grabbing the fields we need, the functions using.

values() and .

values_list() runs ~7x faster than the original function.

In [13]: timeit(get_books_by_author_select_related, number=10)Out[13]: 1.

2787263889913447In [14]: timeit(get_books_by_author_select_related_values, number=1)Out[14]: 0.

19064296898432076In [15]: timeit(get_books_by_author_select_related_values_list, number=1)Out[15]: 0.

17425400999491103Something to note here is these models are super lightweight — there are only 4 fields on the Book model.

If we were working with models which had tons of fields, the results would be more extreme.

Even if you need all of the fields on the object, .

values() and .

values_list() will still provide a substantial performance boost by skipping model initialization.

You can get all of the fields back by not passing in any fields as arguments.

# returns list of model instancesdef get_book_instances(): return list( Book.

objects .

all() .

select_related('author') )# returns list of dictionaries representing model instancesdef get_book_dictionaries(): return list( Book.

objects .

all() .

select_related('author') .

values() )# returns a list of dictionaries with the name of each bookdef get_book_dictionaries_title_only(): return list( Book.

objects .

all() .

select_related('author') .

values('title') )Getting the dictionary representation of the books is 6.

5x faster and asking for a particular field on the book is 8.

8x faster than requesting the model instances.

In [64]: timeit(get_book_instances, number=100)Out[64]: 12.

904168864974054In [65]: timeit(get_book_dictionaries, number=100)Out[65]: 2.

049193776998436In [66]: timeit(get_book_dictionaries_title_only, number=100)Out[66]: 1.

4734381759772077Initializing Django model instances is expensive.

If you’re only using the data on the model you’re likely better off working with their dictionary or tuple representations.

bulk_create()This one is pretty simple.

If we’re going to create more than one object at a time, use bulk_create instead of creating the objects in a loop.

As the name suggests, bulk_create will insert a list of objects into the database using one query, regardless of how many objects we’re inserting.

It’s possible to insert so many objects that the single query generated by bulk_create will be slower than multiple smaller queries.

In that case, you can pass in batch_size=SIZE_OF_BATCH as an argument which will break the master query into smaller queries.

I’m not aware of any rules of thumb regarding the number of objects you need to insert before it makes sense to use batch_size or how to decide on the size of the batch.

Usually I omit it until it becomes clear that there’s a bottleneck and then determine a suitable batch_size from there.

SQL is (generally) faster than PythonLet’s imagine you wanted a function that returns the total number of pages for each library.

Using what we’ve learned above you might end up with something like this:def get_page_count_by_library_id(): result = defaultdict(int) books = Book.

objects.

all().

prefetch_related('pages') for book in books: result[book.

library_id] += book.

get_page_count() return resultEven though this will only fire off two queries, we still have to pull all of the books into memory and loop over each one of them.

This sort of thing can easily be turned into an annotation.

from django.

db.

models import Sumdef get_page_count_by_library_id_using_annotation(): result = {} libraries = ( Library.

objects .

all() .

annotate(page_count=Sum('books__pages')) .

values_list('id', 'page_count') ) for library_id, page_count in libraries: result[library_id] = page_count return resultNow, instead of pulling a bunch of Django instances into memory and calling a model method on each of them, you’re only pulling the two values you actually care about for each Library: the id and page_count.

Our new function runs about 115x faster than the original.

In [66]: timeit(get_page_count_by_library_id, number=10)Out[66]: 158.

0743614450039In [67]: timeit(get_page_count_by_library_id_using_annotation, number=10)Out[67]: 1.

3725216790044215This example is pretty simple but there’s a lot you can do with annotations.

If you’re running into performance issues while doing math on a large queryset, consider writing an annotation to offload that work to the database.

Wrapping upIf you’re working on a Django-based project where there has been little or no thought regarding performance, there should be quite a bit of low hanging fruit to get you started.

In the majority of cases, the techniques described above will provide a substantial performance boost without much of a trade off in terms of readability.

.. More details

Leave a Reply