Big data indexing advice on SQL Server

https://stackoverflow.com/questions/23314530

10-07-2023
|

Domanda

I am about to import around 500 million rows of telemetry data into SQL Server 2008 R2, and I want to make sure I get the indexing/schema right to allow for fast searches of the data. I've been working with databases for a while but nothing on this scale. I'm hoping I can describe my data and the application, and someone can advise me on a good strategy for indexing it.

The data is instrument readings from a data collection system, and has 3 columns: SentTime (datetime2(3)), Topic (nvarchar(255), and Value(float). The SentTime precision is to the millisecond, and is NOT unique. There are around 400 distinct Topics (ex: "Voltage1", "PumpPressure", etc) in the data, and my plan was to break out the data into about 30 tables, each with 10-15 columns, grouped into logical groupings like Voltages, Pressures, Temperatures, etc, each with their own SentTime column.

A typical search will be to retrieve various Values (could be across several tables) for a given time range. Another possible search will be to retrieve all times/values for a given value range and topic. The user interface will show coarse graphs of the data, to allow the user to find the interesting data and export it to Excel or CSV.

My main question is, if I add an index based on SentTime alone, will that speed searches for a given time range? Would it be better to make a composite index on time and value, since the time is not unique? Any point in adding a unique primary key? Is there any other overall strategy or schema I should be looking at for this application?

Another note, I will not be inserting any data once the import is done, so no need to worry about the insertion overhead of indexes.

Soluzione

It seems that you'll be doing a lot of range searches over the SentTime column. In that case, I would create a clustered index on SentTime; with the nonclustered index there would be the overhead of lookups (to retrieve additional data). It is not important that SentTime is not unique, engine will add an uniquifier to it.

Does the Topic column have to be nvarchar; why not a varchar?

My relational self will punish me for this, but it seems that you don't need an additional PK. The data is read-only, right?

One more thought: check the sparse columns feature, it seems that it would be a perfect fit in your scenario. There could be a large number of sparse columns (up to 10.000 if I'm not mistaken), they can be grouped and manipulated as XML, and the main point is that NULLs are almost free storage-wise.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow