sqlalchemy bulk update performance problems

https://stackoverflow.com/questions/16731280

30-05-2022
|

Question

I need to increment values in a column periodically with data I receive in a file. The table has > 400000 rows. So far, all my attempts result in very poor performance. I have written an experiment that reflects my requirements:

#create table
engine = create_engine('sqlite:///bulk_update.db', echo=False)
metadata = MetaData()

sometable = Table('sometable',  metadata,
    Column('id', Integer, Sequence('sometable_id_seq'), primary_key=True),
    Column('column1', Integer),
    Column('column2', Integer),
)

sometable.create(engine, checkfirst=True)

#initial population
conn = engine.connect()
nr_of_rows = 50000
insert_data = [ { 'column1': i, 'column2' : 0 } for i in range(1, nr_of_rows)]
result = conn.execute(sometable.insert(), insert_data)

#update
update_data = [ {'col1' : i, '_increment': randint(1, 500)} for i in range(1, nr_of_rows)]

print "nr_of_rows", nr_of_rows
print "start time   : " + str(datetime.time(datetime.now()))

stmt = sometable.update().\
        where(sometable.c.column1 == bindparam('col1')).\
        values({sometable.c.column2 : sometable.c.column2 +     bindparam('_increment')})

conn.execute(stmt, update_data)

print "end time : " + str(datetime.time(datetime.now()))

the times I get are these:

nr_of_rows 10000
start time  : 10:29:01.753938
end time    : 10:29:16.247651

nr_of_rows 50000
start time  : 10:30:35.236852
end time    : 10:36:39.070423

so doing a 400000+ amount of rows will take much too long.

I am new to sqlalchemy, but I did do a lot of doc reading, and I just can't understand what I am doing wrong.

thanks in advance!

Solution

You are using the correct approach by doing bulk update with single query.

The reason why it takes that long is because the table doesn't have index on the sometable.column1. It has only primary index on column id.

Your update query uses sometable.column1 in where clause to identify record. So database has to scan through the all table records for every single column update.

To make update run much faster you need to update your table schema definition code to add index creation to the column1 definition with , index=True:

sometable = Table('sometable',  metadata,
    Column('id', Integer, Sequence('sometable_id_seq'), primary_key=True),
    Column('column1', Integer, index=True),
    Column('column2', Integer),
)

I tested updated code in my machine - it took <2 seconds for the program to run.

BTW kudos to your question description - you put all code needed to reproduce your problem.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow