SQLite3 c++ perfomance issues with SELECT query on 5 double fields

https://stackoverflow.com//questions/25000887

20-12-2019
|

Question

I'm using SQLite3 to store a 5D regular grid of about 1 000 000 nodes and have some issues with the performance of the "SELECT" query.

Context

Database Description

Each entry is composed of 5+25 doubles and represent a point on the regular grid (node):

5 firsts double : coordinates of the point on the 5D regular grid (v1,v2,...,v5)
25 following doubles : some characteristics (p1,p2,...,p25)

Each point is unique (any combinaison of the 5 first values is unique). The table is created with CREATE TABLE myTable(v1 double,..., v5 double, p1 double,..., p25 double). I added no specific constraint.

The entries are sorted by ascending order after their coordinates (v1, then v2, then v3,...) :

v1|v2|v3|v4|v5|p1|p2|p3|...
 0| 0| 0| 0| 0| x| x| x|...
 0| 0| 0| 0| 1| x| x| x|...
 0| 0| 0| 0| 2| x| x| x|...
...
 0| 0| 0| 1| 0| x| x| x|...
 0| 0| 0| 1| 1| x| x| x|...
 0| 0| 0| 1| 2| x| x| x|...
...

I have created an INDEX on this table, using CREATE INDEX idx ON myTable (v1,v2,v3,v4,v5)

SELECT Query Description

I want to make a "cubic" interpolation in the 5D grid. So I have to extract 4 points in each dimension around the point I want. My SELECT query should return 4*4*4*4*4=1024 points.

Because of symetric properties, I have to make 16 queries instead of 1. Each request is of the form SELECT * FROM myTable WHERE (v1=X AND v2=X AND v3 BETWEEN x1 AND x2 AND v4 BETWEEN y1 AND y2 AND v5 BETWEEN z1 AND z2). Indeed, v1 and v2 are angles. In my 5D grid, I have values from -165 to 180 with a step of 15°. So if I want to interpolate a value at -160, I can't make a query of the type v1 BETWEEN -180 AND -135 (to get -180, -165, -150, -135) because -180 doesn't exists in my table. I can optimize this part to make only 4 queries but that's not my main concern here.

I'm using one prepared statement : sqlite3_prepare_v2(db,"SELECT * FROM myTable WHERE (v1=? AND v2=? AND v3 BETWEEN ? AND ? AND v4 BETWEEN ? AND ? AND v5 BETWEEN ? AND ?"),length,statement,NULL)

Then, for each query, I do :

sqlite3_bind_double(statement, int, double);
while(sqlite3_step(statement)==SQLITE_ROW) {
    // for each row (for each "node") :
    //   1) retrieve its properties
    double myvar1 = sqlite3_colum_double(statement,6)
    double myvar2 = sqlite3_colum_double(statement,7)
    ....
    double myvar25 = sqlite3_colum_double(statement,25)
    //   2) create the object and add it to a vector for the interpolation 
}
sqlite3_reset(statement);
sqlite3_clear_bindings(statement);

The database is opened with sqlite3_open_v2(path,&db,SQLITE_OPEN_READONLY,NULL)

Problem

It takes about 2.5 seconds to make 16 queries and extract 1024 points. Sqlite3 operations take 99.8% of the computing effort (callgrind)...

What am I doing wrong ?

I have tried EXPLAIN QUERY PLAN SELECT * FROM myTable WHERE (v1=0 AND v2=0 AND v3 BETWEEN 0 AND 3 AND v4 BETWEEN 0 AND 3 AND v5 BETWEEN 0 AND 3. The result is SEARCH TABLE myTable USING INDEX idx (v1=? AND v2=? AND v3>? AND v3<?)

Will it be faster to create an unique index based on the 5 coordinates and make 1024 queries ? Before, I used a vector and direct access via indices.

Solution

Normal indexes work best when you do lookups using equality comparisons (=). As you have seen in the EXPLAIN QUERY PLAN output, a non-equality comparison prevents any further columns of the index from being used; the database must scan through all possible v4 and v5 rows to find the results.

You are querying rather few points in a regular grid, so you know exactly the coordinates of all points you want. Just use a simply query that searches for a point with all five exact coordinates, and execute it 1024 times. This will result in a single index lookup that is much more efficient, even if it is executed for each point.

To make multiple queries even more efficient, wrap all of them in a single transaction.
Use a separate R-tree index to look up the points. R-trees are optimized for (multi-dimensional) interval queries. This would result in a query like this:
```
SELECT *
FROM myTable
WHERE rowid IN (SELECT id
                FROM RtreeIndexTable
                WHERE v1 = ?
                  AND v2 = ?
                  AND v3 BETWEEN ? AND ?
                  AND v4 BETWEEN ? AND ?
                  AND v5 BETWEEN ? AND ?)
```
R-trees are typically used for irregular or sparse data; if you can get away with querying individual points, an R-tree might be overkill.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow