Domanda

I have a data set in the size range 1-5 billion 'box' objects stored in an SQLite database file in the format:

[x1,y1,z1,x2,y2,z2,box_id]

and currently I have an operation in a python script that does something like:

import sqlite3 as lite

box_data = lite.connect('boxes.db')
cur = box_data.cursor()
editor_cursor = box_data.cursor()

cur.execute("SELECT * FROM boxes")
while True:
    row = cur.fetchone()
    if row == None:
        break

    row_id = row[6]

    x1_normalized = int(round(row[0]/smallest_box_size))
    y1_normalized = int(round(row[1]/smallest_box_size))
    z1_normalized = int(round(row[2]/smallest_box_size))

    x2_normalized = int(round(row[3]/smallest_box_size))
    y2_normalized = int(round(row[4]/smallest_box_size))
    z2_normalized = int(round(row[5]/smallest_box_size))

    editor_cursor.execute("UPDATE boxes SET x1=?,y1=?,z1=?,x2=?,y2=?,z2=? WHERE id=?",(x1_normalized,y1_normalized,z1_normalized,x2_normalized,y2_normalized,z2_normalized,row_id))

where 'smallest box size' is just some float. It's a simple normalization task whereby basically each box coordinate must be converted from its 'physical' size to normalized integer coordinates.

Currently the process takes on the order of several hours, and I would like to reduce this operating time. Would it be possible to speed up this process in my current python-SQLite process?

Any suggestions on how to implement this process in another faster database program may also be helpful :)

È stato utile?

Soluzione

Have SQLite do all the work for you instead:

editor_cursor.execute("""
    UPDATE boxes SET x1=CAST(x1/:smallest_box_size as INTEGER),
                     y1=CAST(y1/:smallest_box_size as INTEGER),
                     z1=CAST(z1/:smallest_box_size as INTEGER),
                     x2=CAST(x2/:smallest_box_size as INTEGER),
                     y2=CAST(y2/:smallest_box_size as INTEGER),
                     z2=CAST(z2/:smallest_box_size as INTEGER)""", 
    {'smallest_box_size': smallest_box_size})

In other words, SQLite is perfectly capable of normalizing all the rows for you without piping them all through Python.

The CAST to INTEGER will already round a REAL value, no need to add an explicit round() call here.

For future reference: you can loop over a result set by iterating over the cursor. No need to call .fetchone() for each row:

cur.execute("SELECT * FROM boxes")
for row in cur:
    # loop will terminate automatically when the rows are exhausted.

This is implemented very efficiently, it only loads results as required to iterate.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top