Question

I have a collection of data that looks as follows:

id   name     c1    c2    c3    c4   ...  c50
-----------------------------------------------
1    string1  0.1   0.32  0.54 -1.2  ...  2.3
2    string2  0.12  0.12 -0.34  2.45 ...  1.3
...
(millions of records)

So I have an id column, a string column, then 50 floating point columns.

There will be only one type of query run on this data that in a traditional SQL SELECT statement would look like this:

SELECT name FROM table WHERE ((a1-c1)+(a2-c2)+(a3-c3)+...+(a50-c50)) > 1; where a1,a2,a3,etc are values that are generated before the query is sent (not housed in the data table).

My question is this: Does anyone have any recommendations as to what type of database would handle this type of query the fastest. I have used SQL server (which is majorly slow), so I am looking for other opinions.

Would there be a way to optimize SQL server for this type of query? I have also been curious about column store databases such as MonetDB. Or perhaps a document store database such as MongoDB. Does anyone have any suggestions?

Many thanks, Brett

Was it helpful?

Solution

You can continue using SQL Server and use a persisted computed column that calculates the sum of all the values and index that.

ALTER TABLE tablename ADD SumOfAllColumns AS (c1 + c2 + ... + c50) PERSISTED

Then you can rearrange your query as:

SELECT name FROM tablename WHERE SumOfAllColumns < a1+a2+a3+...+a50 - 1

This query will be able to use the index on the computed column and should find the relevant rows quickly.

OTHER TIPS

To stick with SQL Server:

If you always include the same calculations in your queries (same field + or - the same other field, etc) you can create computed columns with persisted values.

Currently your queries will be slow because the engine is running a complicated mathematical operation for each row.

If you add a column with the results, the math is all done once and then it will be a lot faster to run queries.

An in memory database would be best. Have a look at http://hsqldb.org/

Depending on how many millions of rows you have...

Your query condition can be rewritten as:

(a1 + a2 + a3 + ... + a50) > 1 + (c1 + c2 + c3 + ... + c50)

You can precompute c = 1 + c1 + ... + c50 on the database side and a = a1 + ... + a50 on the client side. Then the query then reduces to ... WHERE @a > c. This opens an opportunity to use an index.

However, floating point numbers do not index well in most databases (including SQL Server). If we can make some assumptions about the data, we might be able to work around this. For example, if the numbers are only stored to two digits of precision as in the example, then we can multiply all the numbers by 100 to obtain integers. Then, indexing will work well. Reasonably well, that is... it depends on how many rows meet the condition. Half of "millions of rows" is still a lot of rows.

Even if the values have truly variable precision, so two digits are not accurate enough, it might still make sense to create the integer index to reduce the rows that need to be checked. The query can check both the approximate value (to hit the index) and the exact value (to get the precise result). If you do that, make sure that the original values are rounded in the right direction to avoid losing precise results.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top