SQLite3 c++ perfomance issues with SELECT query on 5 double fields
-
20-12-2019 - |
Question
I'm using SQLite3 to store a 5D regular grid of about 1 000 000 nodes and have some issues with the performance of the "SELECT" query.
Context
Database Description
Each entry is composed of 5+25 doubles and represent a point on the regular grid (node):
- 5 firsts double : coordinates of the point on the 5D regular grid (v1,v2,...,v5)
- 25 following doubles : some characteristics (p1,p2,...,p25)
Each point is unique (any combinaison of the 5 first values is unique).
The table is created with CREATE TABLE myTable(v1 double,..., v5 double, p1 double,..., p25 double)
. I added no specific constraint.
The entries are sorted by ascending order after their coordinates (v1, then v2, then v3,...) :
v1|v2|v3|v4|v5|p1|p2|p3|...
0| 0| 0| 0| 0| x| x| x|...
0| 0| 0| 0| 1| x| x| x|...
0| 0| 0| 0| 2| x| x| x|...
...
0| 0| 0| 1| 0| x| x| x|...
0| 0| 0| 1| 1| x| x| x|...
0| 0| 0| 1| 2| x| x| x|...
...
I have created an INDEX on this table, using CREATE INDEX idx ON myTable (v1,v2,v3,v4,v5)
SELECT Query Description
I want to make a "cubic" interpolation in the 5D grid. So I have to extract 4 points in each dimension around the point I want. My SELECT query should return 4*4*4*4*4=1024 points.
Because of symetric properties, I have to make 16 queries instead of 1. Each request is of the form SELECT * FROM myTable WHERE (v1=X AND v2=X AND v3 BETWEEN x1 AND x2 AND v4 BETWEEN y1 AND y2 AND v5 BETWEEN z1 AND z2)
.
Indeed, v1 and v2 are angles. In my 5D grid, I have values from -165 to 180 with a step of 15°. So if I want to interpolate a value at -160, I can't make a query of the type v1 BETWEEN -180 AND -135
(to get -180, -165, -150, -135) because -180 doesn't exists in my table. I can optimize this part to make only 4 queries but that's not my main concern here.
I'm using one prepared statement :
sqlite3_prepare_v2(db,"SELECT * FROM myTable WHERE (v1=? AND v2=? AND v3 BETWEEN ? AND ? AND v4 BETWEEN ? AND ? AND v5 BETWEEN ? AND ?"),length,statement,NULL)
Then, for each query, I do :
sqlite3_bind_double(statement, int, double);
while(sqlite3_step(statement)==SQLITE_ROW) {
// for each row (for each "node") :
// 1) retrieve its properties
double myvar1 = sqlite3_colum_double(statement,6)
double myvar2 = sqlite3_colum_double(statement,7)
....
double myvar25 = sqlite3_colum_double(statement,25)
// 2) create the object and add it to a vector for the interpolation
}
sqlite3_reset(statement);
sqlite3_clear_bindings(statement);
The database is opened with
sqlite3_open_v2(path,&db,SQLITE_OPEN_READONLY,NULL)
Problem
It takes about 2.5 seconds to make 16 queries and extract 1024 points. Sqlite3 operations take 99.8% of the computing effort (callgrind)...
What am I doing wrong ?
I have tried EXPLAIN QUERY PLAN SELECT * FROM myTable WHERE (v1=0 AND v2=0 AND v3 BETWEEN 0 AND 3 AND v4 BETWEEN 0 AND 3 AND v5 BETWEEN 0 AND 3
.
The result is SEARCH TABLE myTable USING INDEX idx (v1=? AND v2=? AND v3>? AND v3<?)
Will it be faster to create an unique index based on the 5 coordinates and make 1024 queries ? Before, I used a vector and direct access via indices.
Solution
Normal indexes work best when you do lookups using equality comparisons (=
).
As you have seen in the EXPLAIN QUERY PLAN output, a non-equality comparison prevents any further columns of the index from being used; the database must scan through all possible v4
and v5
rows to find the results.
You are querying rather few points in a regular grid, so you know exactly the coordinates of all points you want. Just use a simply query that searches for a point with all five exact coordinates, and execute it 1024 times. This will result in a single index lookup that is much more efficient, even if it is executed for each point.
To make multiple queries even more efficient, wrap all of them in a single transaction.
Use a separate R-tree index to look up the points. R-trees are optimized for (multi-dimensional) interval queries. This would result in a query like this:
SELECT * FROM myTable WHERE rowid IN (SELECT id FROM RtreeIndexTable WHERE v1 = ? AND v2 = ? AND v3 BETWEEN ? AND ? AND v4 BETWEEN ? AND ? AND v5 BETWEEN ? AND ?)
R-trees are typically used for irregular or sparse data; if you can get away with querying individual points, an R-tree might be overkill.