Domanda

There are two tables

tmp_stat:
date, site_id, ip, block_id, count
Primary Key (date, site_id, ip, block_id)

main_stat:
date, site_id, ip, block_id, count
Primary Key (date, site_id, ip, block_id)

I need to insert rows into main_stat from tmp_stat when there are no such (date, site_id, etc) and update count when they already exist as quick as possible

tmp_stat contains about 500000 rows, main_stat contains millons

È stato utile?

Soluzione 3

I'm building on gsimes's answer as I understand the question.

with agg_temp_stat as (
    select date, site_id, ip, block_id, sum(counter)::integer counter
    from temp_stat
    group by 1, 2, 3, 4
), upd as (
    update main_stat t
    set counter = counter + s.counter
    from agg_tmp_stat s
    where
        (t.date, t.site_id, t.ip, t.block_id)
        = (s.date, s.site_id, s.ip, s.block_id)
    returning s.date, s.site_id, s.ip, s.block_id
)
insert into main_stat
select s.date, s.site_id, s.ip, s.block_id, s.counter
from
    agg_tmp_stat s 
    left join
    upd on
        upd.date = s.date 
        and upd.site_id = s.site_id 
        and upd.ip = s.ip 
        and upd.block_id = s.block_id
where upd.date is null

Basically aggregates the temp table and sums the resulting counter to the already existing one.

Altri suggerimenti

Does the following work?

WITH upd AS (
    UPDATE main_stat t
       SET counter = s.counter
      FROM tmp_stat s
     WHERE t.date = s.date
            AND t.site_id = s.site_id
            AND t.ip = s.ip
            AND t.block_id = s.block_id
 RETURNING s.date, s.site_id, s.ip, s.block_id, s.counter
)
INSERT INTO main_stat
     SELECT s.mydate, s.site_id, s.ip, s.block_id, s.counter
       FROM tmp_stat s 
       LEFT JOIN upd ON (upd.date = s.date and  upd.site_id = s.site_id and upd.ip = s.ip and upd.block_id = s.block_id)
      WHERE upd.date IS NULL
;

Update:

It looks like this is only available for version 9.1 or newer.

Using just-somebody's suggestion of WHERE (t.date, t.site_id, t.ip, t.block_id) = (s.date, s.site_id, s.ip, s.block_id) appears to give better performance.

WITH upd AS (
    UPDATE main_stat t
       SET counter = s.counter
      FROM tmp_stat s
     WHERE ( t.date, t.site_id, t.ip, t.block_id ) = ( s.date, s.site_id, s.ip, s.block_id )
 RETURNING s.date, s.site_id, s.ip, s.block_id
)
INSERT INTO main_stat
     SELECT s.date, s.site_id, s.ip, s.block_id, s.counter
       FROM tmp_stat s 
       LEFT JOIN upd 
            ON ( upd.date = s.date 
                AND upd.site_id = s.site_id 
                AND upd.ip = s.ip 
                AND upd.block_id = s.block_id )
      WHERE upd.date IS NULL
;

What's happening here is we are using a CTE to do the UPDATE with the CTE returning the identifying columns for the updated rows.

The INSERT then uses the updated row information to filter tmp_stat to only insert the new records.

There are some concurrency caveats which Dimitri Fontaine covers in this blog entry.

More information on CTEs can be found in the Postgresql documentation.

It seems like simple Exists query ... if the columns are indexed it should be fast enough.

exmple:

-- insert missing rows
INSERT INTO main_stat (date, site_id, ip, block_id)
SELECT date, site_id, ip, block_id FROM tmp_stat tmp
WHERE NOT EXISTS (SELECT 1 FROM main_stats main 
                           WHERE tmp.date    = main.date 
                           AND   tmp.site_id = main.site_id 
                           AND   tmp.ip      = main.ip
                           AND   tmp.block_id = main.block_id
                 );
-- update count for existing rows
UPDATE main_stat main 
SET count =  main.count + (SELECT count FROM tmp_stats tmp
                           WHERE tmp.date    = main.date 
                           AND   tmp.site_id = main.site_id 
                           AND   tmp.ip      = main.ip
                           AND   tmp.block_id = main.block_id
                           LIMIT 1)

WHERE EXISTS (SELECT 1 FROM main_stats main 
                           WHERE tmp.date    = main.date 
                           AND   tmp.site_id = main.site_id 
                           AND   tmp.ip      = main.ip
                           AND   tmp.block_id = main.block_id
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top