Question

I have to read a CSV every 20 seconds. Each CSV contains min. of 500 to max. 60000 lines. I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. The field to check for uniqueness is also indexed.

So, I read the file in chunks and use the IN clause to get the items already in the database.

Is there a better way of doing it?

Was it helpful?

Solution

This should perform well:

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data

COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
LEFT   JOIN tbl USING (tbl_id)
WHERE  tbl.tbl_id IS NULL;

DROP TABLE tmp; -- else dropped at end of session automatically

Closely related to this answer.

OTHER TIPS

First just for completeness I changed Erwin's code to use except

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
except
select *
from tbl

DROP TABLE tmp;

Then I resolved to test it myself. I tested it in 9.1 with a mostly untouched postgresql.conf. The target table contains 10 million rows and the origin table 30 thousand. 15 thousand already exists in the target table.

create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;

I asked for the explain of the select part only. The except version:

explain
select *
from tmp
except
select *
from tbl
;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 HashSetOp Except  (cost=0.00..270098.68 rows=200 width=4)
   ->  Append  (cost=0.00..245018.94 rows=10031897 width=4)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..771.40 rows=31920 width=4)
               ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..244247.54 rows=9999977 width=4)
               ->  Seq Scan on tbl  (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)

The outer join version:

explain
select *
from 
    tmp
    left join
    tbl using (id)
where tbl.id is null
;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..208142.58 rows=15960 width=4)
   ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
   ->  Index Scan using tbl_pkey on tbl  (cost=0.00..7.80 rows=1 width=4)
         Index Cond: (tmp.id = id)
(4 rows)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top