I have an idempotent background processing task that takes row of information, does some clean up and inserts into a database. My problem is that the same information may be processed more than once.
To deal with this, I created a key (hashed) from information I have about each row of data and created a unique constraint on an index to prevent duplicates.
The problem: I check if the data already exists in the DB by doing:
SELECT key FROM items WHERE key IN (key,key,key,key).
I found this query to be a bit faster, but still have some slow responses
SELECT key FROM items WHERE (key = ANY(VALUES(key),(key)))
I then do an intersection of the keys returned and the keys I expect and only process the data that does not already exist.
This worked well until the table reached 100 million plus and I can be checking for 100+ keys at a time which is causing a fair amount of IO scanning and retrieving each row.
My question: Is there a more efficient way to check for existence using the unique constraint and the index? Perhaps something that does not actually go to each row?
Or, is there a different approach that might work? Would simply attempting to insert and catching the unique constraint violation actually be faster?
Simplified table definition:
Column | Type | Modifiers | Storage | Description
------------------------+-----------------------------+---------------------------------------------------------------+----------+-------------
id | integer | not null default nextval('items_id_seq'::regclass) | plain |
created_at | timestamp without time zone | not null | plain |
updated_at | timestamp without time zone | not null | plain |
key | character varying(255) | | extended |
item_attributes | hstore | | extended |
item_name | character varying(255) | | plain |
Indexes:
"items_pkey" PRIMARY KEY, btree (id)
"index_items_on_key" UNIQUE, btree (key)
And a query plan:
Nested Loop (cost=0.10..108.25 rows=25 width=41) (actual time=0.315..2.169 rows=25 loops=1)
-> HashAggregate (cost=0.10..0.17 rows=25 width=32) (actual time=0.071..0.097 rows=25 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.09 rows=25 width=32) (actual time=0.009..0.033 rows=25 loops=1)
-> Index Scan using index_items_on_key on items (cost=0.00..4.32 rows=1 width=41) (actual time=0.076..0.077 rows=1 loops=25)
Index Cond: ((key)::text = "*VALUES*".column1)
Total runtime: 2.406 ms