How does database query time scale with database size?

https://stackoverflow.com/questions/4973855

12-11-2019
|

Question

I was recently on the OEIS (Online Encyclopedia of Integer Sequences) recently, trying to look up a particular sequence I had on had.

Now, this database is fairly large. The website states that if the 2006 (! 5 years old) edition were printed, it would occupy 750 volumes of text.

I'm sure this is the same sort of issue Google has to handle as well. But, they also have a distributed system where they take advantage of load balancing.

Neglecting load balancing however, how much time does it take to do a query compared to database size?

Or in other words, what is the time complexity of a query with respect to DB size?

Edit: To make things more specific, assume the input query is simply looking up a string of numbers such as:

1, 4, 9, 16, 25, 36, 49

Solution

It strongly depends on the query, structure of the database, contention, and so on. But in general most databases will find a way to use an index, and that index will either be some kind of tree structure (see http://en.wikipedia.org/wiki/B-tree for one option) in which case access time is proportional to log(n), or else a hash in which case access time is proportional to O(1) on average (see http://en.wikipedia.org/wiki/Hash_function#Hash_tables for an explanation of how they work).

So the answer is typically O(1) or O(log(n)) depending on which type of data structure is used.

This may cause you to wonder why we don't always use hash functions. There are multiple reasons. Hash functions make it hard to retrieve ranges of values. If the hash function fails to distribute data well, it is possible for access time to become O(n). Hashes need resizing occasionally, which is potentially very expensive. And log(n) grows slowly enough that you can treat it as being reasonably close to constant across all practical data sets. (From 1000 to 1 petabyte it varies by a factor of 5.) And frequently the actively requested data shows some sort of locality, which trees do a better job of keeping in RAM. As a result trees are somewhat more commonly seen in practice. (Though hashes are by no means rare.)

OTHER TIPS

That depends on a number of factors including the database engine implementation, indexing strategy, specifics of the query, available hardware, database configuration, etc.

There is no way to answer such a general question.

A properly designed and implemented database with terabytes of data may actually outperform a badly designed little database (particulaly one with no indexing and one that uses badly performing non-sargable queries and things such as correlated subqueries). This is why anyone expecting to have large amounts of data needs to hire an expert on databse design for large databases to do the intial design not later when the database is large. You may also need to invest in the type of equipment you need to handle the size as well.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow