Django: select_related() and memory usage

https://stackoverflow.com/questions/8745979

14-04-2021
|

Question

I am working on an API, and I have a question. I was looking into the usage of select_related(), in order to save myself some database queries, and indeed it does help in reducing the amount of database queries performed, on the expensed of bigge and more complex queries.

My question is, does using select_related() cause heasvier memory usage? Running some experiments I noticed that indeed this is the case, but I'm wondering why. Regardless of whether I use select_related(), the response will contain the exact same data, so why does the use of select_related() cause more memory to be used?

Is it because of caching? Maybe separate data objects are used to cache the same model instances? I don't know what else to think.

Thanks in advance.

Solution

It's a tradeoff. It takes time to send a query to the database, the database to prepare results, and then send those results back. select_related works off the principle that the most expensive part of this process is the request and response cycle, not the actual query, so it allows you to combine what would otherwise have been distinct queries into just one so there's only one request and response instead of multiple.

However, if your database server is under-powered (not enough RAM, processing power, etc.), the larger query could actually end up taking longer than the request and response cycle. If that's the case, you probably need to upgrade the server, though, rather than not use select_related.

The rule of thumb is that if you need related data, you use select_related. If it's not actually faster, then that's a sign that you need to optimize your database.

UPDATE (adding more explanation)

Querying a database actually involves multiple steps:

Application generates the query (negligible)
Query is sent to the database server (milliseconds to seconds)
Database processes the query (milliseconds to seconds)
Query results are sent back to application (milliseconds to seconds)

In a well-tuned environment (sufficient server resources, fast connections) the entire process is finished in mere milliseconds. However, steps 2 and 4, still usually take more time overall than step 3. This is why it makes more sense to send fewer more complex queries than multiple simpler queries: the bottleneck is most usually the transport layer not the processing.

However, a poorly optimized database, on an under-powered machine with large and complex tables could take a very long time to run the query, becoming the bottleneck. That would end up negating the decrease in time gained from sending one complex query instead of multiple simpler ones, i.e. the database would have responded quicker to the simpler queries and the entire process would have taken less net time.

Nevertheless, if this is the case, the proper response is to fix the database-side: optimize the database and its configuration, add more server resources, etc., rather than reverting to sending multiple simple queries instead.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow