Question

We're using SQL Server 2008 R2 Enterprise Edition.

We are measuring meteorological data from what we call MetMasts. Basically this is a mast with lots of equipment; anemometers (for wind speed) at different positions on the mast, thermometers , and air pressure. We measure every second.

And it takes up tooooo much disk space. The next generation of this equipment will generate over 10 GB per year each. And we’re going to have more than 1000 of these.

The current table design looks a bit like this:

    CREATE TABLE #MetMast (
    MetMastID INT NOT NULL IDENTITY(1,1), 
    MetMastName NVARCHAR(100), 
    CountryID INT, 
    InstallDate DATE
)
    CREATE TABLE #MetMastData (
        MetMastDataID BIGINT NOT NULL IDENTITY(1,1),
        MetMastID INT NOT NULL,
        MeasuredAt DATETIME2(0) NOT NULL,
        Temperature REAL NULL,
        WindSpeedAt10m REAL NULL, 
        WindSpeedAt30m REAL NULL,
        AirPressure REAL NULL,
        OneHundredMoreColumns VARCHAR(200),
     CONSTRAINT PK_MetMastData PRIMARY KEY CLUSTERED 
    (
        MetMastID ASC,
        MeasuredAt ASC
    ))
    WITH (DATA_COMPRESSION = ROW) 
    -- ON a file group, with table partitioning
    ALTER TABLE #MetMastData WITH NOCHECK ADD CONSTRAINT FK_MetMast_MetMastID FOREIGN KEY (#MetMast) REFERENCES #MetMast(MetMastID)

The data is write once, read many, many times.
We use it in our data warehouse, where a typical question would be; Count how many times there is a 2 m/s difference between WindSpeedAt10m and WindSpeedAt30m when the temperature is above 20 degrees, per MetMast.

SELECT MetMastId, COUNT_BIG(*) FROM #metMastData 
WHERE temperature>20 AND ABS(WindSpeedAt10m-WindSpeedAt30m) >2 
GROUP BY MetMastID

In the future a tiny bit of data loss will be accepted.
We’re talking lossy compression of data here. I know we will have to define an acceptable error for each of the fields, as in 1% if we measure with 10% accuracy.
It worked for sound files (MP3 is quite big), so it might work for us as well.

But how is this done?
What table design should I go for?
How do I get started with lossy compression of data in database tables?

Best regards,

Henrik Staun Poulsen

Was it helpful?

Solution

For each of your data points, consider the accuracy you need to store.

REAL takes up four bytes for each row. If you could drop all decimal places for WindSpeed, you could probably do with a tinyint (1 byte, 1-255). Given that you most likely need some precision, you could use a smallint instead and multiply the actual value by 100:

150,55 m/s = 15055
3,67 m/s = 367

This would save you two bytes per row and store some precision, though with a loss at some point. Since it seems you'll have quite a lot of these columns, a 2 byte saving per column would amount to quite a lot.

You've got an 8 byte bigint for your MetMastDataID. Is it necesary? Won't everything be queried by MetMastID and MeasuredAT? Dropping that will save you 8 bytes. It will however result in fragmentation since your clustered key will no longer be sequential, so defragmentation will be necessary. Since this sounds like an archival/OLAP system, that shouldn't be a big problem.

EDIT: I just realized you're not clustered on the MetMastDataID so fragmentation won't change from now. Question is then - do you ever use the MetMastDataID for anything?

Further - if you can avoid all variable length columns, that'll save you 2 bytes + 2 bytes per variable length column of record overhead, per row, not including the actual variable length data itself.

OTHER TIPS

Lossy compression is based on human`s physical possibilities to determine difference by eye or ear. Examples are Mp3 or JPEG lossy compression. In your case such kind of lossy compression has no sense, because you operate with digits not with audio/video data. To implement lossless comression you can use CLR function.Example is here:http://www.codeproject.com/KB/database/blob_compress.aspx.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top