Pergunta

I have data which is a matrix of integer values which indicate a banded distribution curve. I'm optimizing for SELECT performance over INSERT performance. There are max 100 bands. I'll primarily be querying this data by summing or averaging bands across a period of time.

My question is can I achieve better performance by flattening this data across a table with 1 column for each band, or by using a single column representing the band value?

Flattened data

UserId ActivityId DateValue Band1 Band2 Band3....Band100
10001  10002      1/1/2013  1     5     100      200

OR Normalized

UserId ActivityId DateValue Band BandValue
10001  10002      1/1/2013  1    1
10001  10002      1/1/2013  2    5
10001  10002      1/1/2013  3    100

Sample query

SELECT AVG(Band1), AVG(Band2), AVG(Band3)...AVG(Band100)
FROM ActivityBands
GROUP BY UserId
WHERE DateValue > '1/1/2012' AND DateValue < '1/1/2013'
Foi útil?

Solução

Store the data in the normalized format.

If you are not getting acceptable performance from this scheme, instead of denormalizing, first consider what indexes you have on the table. You're likely missing an index that would make this perform similar to the denormalized table. Next, try writing a query to retrieve data from the normalized table so that the result set looks like the denormalized table, and use that query to create an indexed view. This will give you select performance identical to the denormalized table, but retain the nice data organization benefits of the proper normalization.

Outras dicas

Denormalization optimizes exactly one means of accessing the data, at the expense of (almost all) others.

If you have only one access method that is performance critical, denormalization is likely to help; though proper index selection is of greater benefit. However, if you have multiple performance critical access paths to the data, you are better to seek other optimizations.

Creation of an appropriate clustered index; putting your non-clustered indices on SSD's. increasing memory on your server; are all techniques that will improve performance for all* accesses, rather than trading off between various accesses.

If you are accessing all (or most) of the bands in each row, then the denormalized form is better. Much better in my experience.

The reason is simple. The size of the data in the pages is much smaller, so many fewer pages need to be read to satisfy the query. The overhead for storing one band per row is about 4 integers or 32 bytes. So, 100 bands is about 3200 bytes. Within a single record, the record size is 100*4+8 or about 408 bytes. If your query is reading a significant number of records, this reduces the I/O requirements, significantly.

There is a caveat. If you only are reading one records worth, then 100 records fit on a single page in SQL and one record fits on a single page. The I/O for a single page read could be identical in the two cases. The benefit arises are you read more and more data.

Your sample query is reading hundreds or thousands of rows, so denormalization should benefit such a query.

If you would like to fetch data very fast then you should flatten out the table and use indexes to improve selecting over a broad column range similar to what you have proposed. However, If you are interested in building data for quick updates then using 3rd or 4th level normalization in combination with a lot of table joins should offer better performance.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top