Advice on how to scale and improve execution times of a “pivot-based query” on a billion rows table, increasing one million a day

https://stackoverflow.com/questions/1002086

05-07-2019
|

Question

Our company is developing an internal project to parse text files. Those text files are composed of metadata which is extracted using regular expresions. Ten computers are 24/7 parsing the text files and feeding a high-end Intel Xeon SQL Server 2005 database with the extracted metadata.

The simplified database schema looks like this:

Items

| Id | Name   |
|----|--------|
| 1  | Sample |

Items_Attributes

| ItemId | AttributeId |
|--------|-------------|
| 1      | 1           |
| 1      | 2           |

Attributes

| Id | AttributeTypeId | Value |
|----|-----------------|-------|
| 1  | 1               | 500mB |
| 2  | 2               | 1.0.0 |

AttributeTypes

| Id | Name    |
|----|---------|
| 1  | Size    |
| 2  | Version |

There are many distinct text files types with distinct metadata inside. For every text file we have an Item and for every extracted metadata value we have an Attribute.



Items_Attributes allow us to avoid duplicate Attribute values which avoids database size to increase x^10.

This particular schema allows us to dynamically add new regular expressions and to obtain new metadata from new processed files no matter which internal structure they have. 

Additionally this allow us to filter the data and to obtain dynamic reports based on the user criteria. We are filtering by Attribute and then pivoting the resultset (http://msdn.microsoft.com/en-us/library/ms177410.aspx). So this example pseudo-sql query

SELECT FROM Items WHERE Size = @A AND Version = @B


would return a pivoted table like this

| ItemName | Size  | Version |
|----------|-------|---------|
| Sample   | 500mB | 1.0.0   |

The application has been running for months and performance decreased terribly at the point is no longer usable. Reports should take no more than 2 seconds and Items_Attributes table increases an average of 10,000,000 rows per week. Everything is properly indexed and we spent severe time analyzing and optimizing query execution plans.

So my question is, how would you scale this in order to decrease report execution times?

We came with this possible solutions:

Buy more hardware and setup an SQL Server cluster. (we need advice on the proper "clustering" strategy)
Use a key/value database like HBase (we don't really know if would solve our problem)
Use a ODBMS rather than a RDBMS (we have been considering db4o)
Move our software to the cloud (we have zero experience)
Statically generate reports at runtime. (we don't really want to)
Static indexed views for common reports (performance is almost the same)
De-normalize schema (some of our reports involves up to 50 tables in a single query)

Solution

Perhaps this white paper by SQL Server CAT team on the pitfalls of Entity-Attribute-Value database model can help: http://sqlcat.com/whitepapers/archive/2008/09/03/best-practices-for-semantic-data-modeling-for-performance-and-scalability.aspx

OTHER TIPS

I'd start from posting exact tables metadata (along with indexing details), exact query text and the execution plan.

With you current table layout, the query similar to this:

SELECT FROM Items WHERE Size = @A AND Version = @B

cannot benefit from using a composite index on (Size, Version), since it's impossible to build such an index.

You cannot even build an indexed view, since it would contain a self-join on attributes.

Probably the best decision would be to denormalize the table like this:

id  name  size  version

and create an index on (size, version)

Worked with such schemas a lot of time. They never perform well. The best thing is to just store the data as you need it, in the form:

| ItemName | Size | Version | |----------|-------|---------| | Sample | 500mB | 1.0.0 |

Then you don;t need to pivot. And BTW, please do not call your original EAV schema "normalized" - it is not normalized.

Looks to me like issuing some OLAP queries on a database optimized for OLTP transactions. Don't knowing details, I'd recommend building a separate "datawarehouse" optimized for the kind of queries you are doing. That would involve aggregating data (if possible), denormalization and also having a data base, which is 1 day old or so. You would incrementally update the data each day or at any interval you wish.

Please post exact DDL and indexes, if you have indexes on the ID columns then your query will result in a scan

instead of something like this

SELECT FROM Items WHERE Size = @A AND Version = @B

you need to do this

SELECT FROM Items WHERE ID = 1

in other words you need to grab the text values, find the ids that you are indexing on and then use that as your query to return results instead

Probably also a good idea to look at partitioning function to distribute your data

clustering is done for availability not performance, if one node dies (the active cluster) , the other node (the passive cluster) will become active....of course there is also active active clustering but that is another story

A short term fix may be to use horizontal partitioning. I am assuming your largest table is Items_Attributes. You could horizontally partition this table, putting each partition on a separate filegroup on a separate disk controller.

That's assuming you are not trying to report across all ItemIds at once.

You mention 50 tables in a single query. Whilst SQL server supports up to 256 tables in a single, monolithic query, taking this approach reduces the chances of the optimiser producing an efficient plan.

If you are wedded to the schema as it stands, consider breaking your reporting queries down into a series of steps which materialise their results into temporary (#) tables. This approach enables you to carry out the most selective parts of the query in isolation, and can, in my experience, offer big performance gains. The queries are generally more maintainable too.

Also (a bit of a long shot, this) you don't say which SQL server version you're on; but if you're on SQL 2005, given the number of tables involved in your reports and the volume of data, it's worth checking that your SQL server is patched to at least SP2.

I worked on an ETL project using tables with rowcounts in the hundreds of millions, where we found that the query optimiser in SQL 2005 RTM/SP1 could not consistently produce efficient plans for queries joining more than 5 tables where one or more of the tables was of this scale. This issue was resolved in SP2.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow