Frage

I am attempting to select from (with a WHERE clause) and sort a large database table in sqlite3 via python. The sort is currently taking 30+ minutes on about 36 MB of data. I have a feeling it can work faster than this with indices, but I think the order of my code may be incorrect.

The code is executed in the order listed here.

  1. My CREATE TABLE statement looks like this:

    c.execute('''CREATE table gtfs_stop_times (
      trip_id text , --REFERENCES gtfs_trips(trip_id),
      arrival_time text, -- CHECK (arrival_time LIKE '__:__:__'),
      departure_time text, -- CHECK (departure_time LIKE '__:__:__'),
      stop_id text , --REFERENCES gtfs_stops(stop_id),
      stop_sequence int NOT NULL --NOT NULL
    )''')
    
  2. The rows are then inserted in the next step:

    stop_times = csv.reader(open("tmp\\avl_stop_times.txt"))
    c.executemany('INSERT INTO gtfs_stop_times VALUES (?,?,?,?,?)', stop_times)
    
  3. Next, I create an index out of two columns (trip_id and stop_sequence):

    c.execute('CREATE INDEX trip_seq ON gtfs_stop_times (trip_id, stop_sequence)')
    
  4. Finally, I run a SELECT statement with a WHERE clause that sorts this data by the two columns used in the index and then write that to a csv file:

    c.execute('''SELECT gtfs_stop_times.trip_id, gtfs_stop_times.arrival_time, gtfs_stop_times.departure_time, gtfs_stops.stop_id, gtfs_stop_times.stop_sequence
    FROM gtfs_stop_times, gtfs_stops
    WHERE gtfs_stop_times.stop_id=gtfs_stops.stop_code
    ORDER BY gtfs_stop_times.trip_id, gtfs_stop_times.stop_sequence;
    )''')
    
    f = open("gtfs_update\\stop_times.txt", "w")
    writer = csv.writer(f, dialect = 'excel')
    writer.writerow([i[0] for i in c.description]) # write headers
    writer.writerows(c)
    del writer
    

Is there any way to speed up Step 4 (possibly be changing how I add and/or use the index) or should I just go to lunch while this runs?

I have added PRAGMA statements to try to improve performance to no avail:

c.execute('PRAGMA main.page_size = 4096')
c.execute('PRAGMA main.cache_size=10000')
c.execute('PRAGMA main.locking_mode=EXCLUSIVE')
c.execute('PRAGMA main.synchronous=NORMAL')
c.execute('PRAGMA main.journal_mode=WAL')
c.execute('PRAGMA main.cache_size=5000')    
War es hilfreich?

Lösung

The SELECT executes extremely fast because there is no gtfs_stops table and you get nothing but an error message.

If we assume that there is a gtfs_stops table, then your trip_seq index is already quite optimal for the query. However, you also need an index for looking up stop_code values in the gtfs_stops column.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top