Efficient way to determine if large set of rows exists in postgres

https://stackoverflow.com/questions/22315039

12-06-2023
|

Question

I have an idempotent background processing task that takes row of information, does some clean up and inserts into a database. My problem is that the same information may be processed more than once.

To deal with this, I created a key (hashed) from information I have about each row of data and created a unique constraint on an index to prevent duplicates.

The problem: I check if the data already exists in the DB by doing:

SELECT key FROM items WHERE key IN (key,key,key,key).

I found this query to be a bit faster, but still have some slow responses

SELECT key FROM items WHERE (key = ANY(VALUES(key),(key)))

I then do an intersection of the keys returned and the keys I expect and only process the data that does not already exist.

This worked well until the table reached 100 million plus and I can be checking for 100+ keys at a time which is causing a fair amount of IO scanning and retrieving each row.

My question: Is there a more efficient way to check for existence using the unique constraint and the index? Perhaps something that does not actually go to each row?

Or, is there a different approach that might work? Would simply attempting to insert and catching the unique constraint violation actually be faster?

Simplified table definition:

Column         |            Type             |                           Modifiers                           | Storage  | Description
------------------------+-----------------------------+---------------------------------------------------------------+----------+-------------
 id                     | integer                     | not null default nextval('items_id_seq'::regclass) | plain    |
 created_at             | timestamp without time zone | not null                                                      | plain    |
 updated_at             | timestamp without time zone | not null                                                      | plain    |
 key                    | character varying(255)      |                                                               | extended |
 item_attributes        | hstore                      |                                                               | extended |
 item_name              | character varying(255)      |                                                               | plain    |
Indexes:
    "items_pkey" PRIMARY KEY, btree (id)
    "index_items_on_key" UNIQUE, btree (key)

And a query plan:

Nested Loop  (cost=0.10..108.25 rows=25 width=41) (actual time=0.315..2.169 rows=25 loops=1)
   ->  HashAggregate  (cost=0.10..0.17 rows=25 width=32) (actual time=0.071..0.097 rows=25 loops=1)
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.09 rows=25 width=32) (actual time=0.009..0.033 rows=25 loops=1)
   ->  Index Scan using index_items_on_key on items  (cost=0.00..4.32 rows=1 width=41) (actual time=0.076..0.077 rows=1 loops=25)
         Index Cond: ((key)::text = "*VALUES*".column1)
 Total runtime: 2.406 ms

Solution

You don't say where does the data come from and how it is processed. This is the generic approach

with to_be_inserted (id, key) as (
    values (1, 'the_hash'), (2, 'another_hash')
)
insert into items (id, key)
select f(id, key)
from to_be_inserted tbi
where not exists (
    select 1
    from items
    where key = tbi.key
);

There is a potential for a significant performance gain if you store the hash as bytea in instead of as text as it is half the size making the index also a halve. And use the smaller md5 hash.

If the processing can't be done in SQL this key seek might be faster

with might_be_inserted (key) as (
    values ('hash1'), ('hash2')
)
select key 
from might_be_inserted mbi
where not exists (
    select 1
    from items
    where key = mbi.key
)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow