Sorting Multi-indexed Large Database Table in Sqlite

https://stackoverflow.com/questions/17956867

04-06-2022
|

Frage

I am attempting to select from (with a WHERE clause) and sort a large database table in sqlite3 via python. The sort is currently taking 30+ minutes on about 36 MB of data. I have a feeling it can work faster than this with indices, but I think the order of my code may be incorrect.

The code is executed in the order listed here.

My CREATE TABLE statement looks like this:

c.execute('''CREATE table gtfs_stop_times (
  trip_id text , --REFERENCES gtfs_trips(trip_id),
  arrival_time text, -- CHECK (arrival_time LIKE '__:__:__'),
  departure_time text, -- CHECK (departure_time LIKE '__:__:__'),
  stop_id text , --REFERENCES gtfs_stops(stop_id),
  stop_sequence int NOT NULL --NOT NULL
)''')

The rows are then inserted in the next step:

stop_times = csv.reader(open("tmp\\avl_stop_times.txt"))
c.executemany('INSERT INTO gtfs_stop_times VALUES (?,?,?,?,?)', stop_times)

Next, I create an index out of two columns (trip_id and stop_sequence):

c.execute('CREATE INDEX trip_seq ON gtfs_stop_times (trip_id, stop_sequence)')

Finally, I run a SELECT statement with a WHERE clause that sorts this data by the two columns used in the index and then write that to a csv file:

c.execute('''SELECT gtfs_stop_times.trip_id, gtfs_stop_times.arrival_time, gtfs_stop_times.departure_time, gtfs_stops.stop_id, gtfs_stop_times.stop_sequence
FROM gtfs_stop_times, gtfs_stops
WHERE gtfs_stop_times.stop_id=gtfs_stops.stop_code
ORDER BY gtfs_stop_times.trip_id, gtfs_stop_times.stop_sequence;
)''')

f = open("gtfs_update\\stop_times.txt", "w")
writer = csv.writer(f, dialect = 'excel')
writer.writerow([i[0] for i in c.description]) # write headers
writer.writerows(c)
del writer

Is there any way to speed up Step 4 (possibly be changing how I add and/or use the index) or should I just go to lunch while this runs?

I have added PRAGMA statements to try to improve performance to no avail:

c.execute('PRAGMA main.page_size = 4096')
c.execute('PRAGMA main.cache_size=10000')
c.execute('PRAGMA main.locking_mode=EXCLUSIVE')
c.execute('PRAGMA main.synchronous=NORMAL')
c.execute('PRAGMA main.journal_mode=WAL')
c.execute('PRAGMA main.cache_size=5000')

Lösung

The SELECT executes extremely fast because there is no gtfs_stops table and you get nothing but an error message.

If we assume that there is a gtfs_stops table, then your trip_seq index is already quite optimal for the query. However, you also need an index for looking up stop_code values in the gtfs_stops column.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow