Pergunta

Consider the two following Python code examples, which achieves the same but with significant and surprising performance difference.

import psycopg2, time

conn = psycopg2.connect("dbname=mydatabase user=postgres")
cur = conn.cursor('cursor_unique_name')  
cur2 = conn.cursor()

startTime = time.clock()
cur.execute("SELECT * FROM test for update;")
print ("Finished: SELECT * FROM test for update;: " + str(time.clock() - startTime));
for i in range (100000):
    cur.fetchone()
    cur2.execute("update test set num = num + 1 where current of cursor_unique_name;")
print ("Finished: update starting commit: " + str(time.clock() - startTime));
conn.commit()
print ("Finished: update : " + str(time.clock() - startTime));

cur2.close()
conn.close()

And:

import psycopg2, time

conn = psycopg2.connect("dbname=mydatabase user=postgres")
cur = conn.cursor('cursor_unique_name')  
cur2 = conn.cursor()

startTime = time.clock()
for i in range (100000):
    cur2.execute("update test set num = num + 1 where id = " + str(i) + ";")
print ("Finished: update starting commit: " + str(time.clock() - startTime));
conn.commit()
print ("Finished: update : " + str(time.clock() - startTime));

cur2.close()
conn.close()

The create statement for the table test is:

CREATE TABLE test (id serial PRIMARY KEY, num integer, data varchar);

And that table contains 100000 rows and VACUUM ANALYZE TEST; has been run.

I got the following results consistently on several attempts.

First code example:

Finished: SELECT * FROM test for update;: 0.00609304950429
Finished: update starting commit: 37.3272754429
Finished: update : 37.4449708474

Second code example:

Finished: update starting commit: 24.574401185
Finished committing: 24.7331461431

This is very surprising to me as I would think is should be exactly opposite, meaning that an update using cursor should be significantly faster according to this answer.

Foi útil?

Solução

I don't think that the test is balanced- your first code is fetching the data from the cursor, then updating, whereas the second is blindly updating by ID without fetching the data. I assume the first code sequence translates to a FETCH command followed by UPDATE- so that's two client/server command turnarounds as opposed to one.

(Also the first code starts by locking each row in the table- this pulls the entire table into the buffer cache- although thinking about it, I doubt this actually impacts performance but you didn't mention it)

Also tbh I think that for a simple table, there won't be much different between updating by ctid (which I assume is how where current of... works) and updating through a primary key- the pkey update is an extra index lookup, but unless the index is huge it's not much of a degradation.

For updating 100,000 rows like this, I suspect that most of the time is taken up generating the extra tuples and inserting them into or appending them to the table, rather than locating the previous tuple to mark it as deleted.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top