Best way to test performance of a data warehouse?

https://stackoverflow.com/questions/4811989

25-10-2019
|

Pergunta

I have a data warehouse based on Postgresql.

Until now, I have been trying to run queries on a database containing just a fraction of my real database. Once I have written the queries in a way that make them efficient for this small test database I run the queries on the real one.

The problem is that once I run the queries on the real database, the real database runs out of memory and starts writing things like indexes and temp tables to disk. This means that it could be that different queries would be optimal for for test database and the real database. Does this mean that I really have to run queries that take several minutes to complete in order to know, which query is the optimal one.

Solução

Learn how to interpret the EXPLAIN output, then check that the EXPLAIN output shows that the chosen query plan in your large database is similar to what you would expect, before running the query.

Outras dicas

Three questions:

1) How complex are the queries? The generation of indexes and temp tables suggests the server has to generate these things because of complex operations on unindexed columns. How likely is this? From what you report, it seems the likely answer is "complex"

2) How large are the returns sets? Is the end result 100 rows or 1 million? From what you report, the answer could be anything. I suspect this question is not as important, but it is important at least to know.

3) Restating question 1 in a different way, even if the returned sets are small, are there enormous intermediate results that have to be compiled on the way to the small result? Again, I suspect the answer here is large complex intermediate results are being generated.

This would suggest that at very least some things need to be indexed, and perhaps the data needs to be structured on the way in to be closer to what you are trying to query.

One last question, is this a pervasive problem for most of your more important queries, or only for one or two?

EDIT IN RESPONSE TO COMMENT: I do data warehouse queries all day, and some of them take 10 minutes or so. Some take hours, and I push them off into a background job and break them up into stages to prevent bogging everything down. That is the nature of handling very large data sets.

My questions in the original answer are aimed at figuring out if your problem queries will ever finish. It is possible to unwittingly write a query that produces so much intermediate data that you can walk away, come back 2 days later, and it is still running. So I would restate my original three questions, they are in fact the only way to answer your question completely.

Recap: Yes, some queries take much longer, it is the nature of the beast. The best you can hope for is performance linear to the amount of data being read, and if there 100 million rows to process, that will take minutes instead of seconds. But much more importantly, if a query runs in 4 seconds on 1 million rows, but on 100 million rows takes >> 400 seconds (like an hour) then those original questions I asked will help you figure out why, with the aim to optimize those queries.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow