Question

This seems like it should be easy, but isn't. I'm migrating a query from MySQL to Redshift of the form:

INSERT INTO table
(...)
VALUES
(...)
ON DUPLICATE KEY UPDATE
  value = MIN(value, VALUES(value))

For primary keys we're inserting that aren't already in the table, those are just inserted. For primary keys that are already in the table, we update the row's values based on a condition that depends on the existing and new values in the row.

http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html does not work, because filter_expression in my case depends on the current entries in the table. I'm currently creating a staging table, inserting into it with a COPY statement and am trying to figure out the best way to merge the staging and real tables.

Was it helpful?

Solution

I'm having to do exactly this for a project right now. The method I'm using involves 3 steps:

1.

Run an update that addresses changed fields (I'm updating whether or not the fields have changed, but you can certainly qualify that):

update table1 set col1=s.col1, col2=s.col2,...
from table1 t
 join stagetable s on s.primkey=t.primkey;

2.

Run an insert that addresses new records:

insert into table1
select s.* 
from stagetable s 
 left outer join table1 t on s.primkey=t.primkey
where t.primkey is null;

3.

Mark rows no longer in the source as inactive (our reporting tool uses views that filter inactive records):

update table1 
set is_active_flag='N', last_updated=sysdate
from table1 t
 left outer join stagetable s on s.primkey=t.primkey
where s.primkey is null;

OTHER TIPS

Is posible to create a temp table. In redshift is better to delete and insert the record. Check this doc

http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html

Here is the fully working approach for Redshift.

Assumptions:

A.Data available in S3 in gunzip format with '|' separated columns, may have some garbage data see maxerror.

B.Sales fact with two dimension tables to keep it simple (TIME and SKU(SKU may have many groups and categories))).

C.You have Sales table like this.

CREATE TABLE sales (
 sku_id int encode zstd,
 date_id int encode zstd,
quantity numeric(10,2) encode delta32k,
);

1)Create Staging table, that should resemble with your Online Table used by app/apps.

CREATE TABLE stg_sales_onetime (
 sku_number varchar(255) encode zstd,
 time varchar(255) encode zstd,
 qty_str varchar(20) encode zstd,
 quantity numeric(10,2) encode delta32k,
 sku_id int encode zstd,
 date_id int encode zstd
);

2)Copy data from S3( this could done using SSH).

copy stg_sales_onetime (sku_number,time,qty_str) from 
  's3://<buecket_name>/<full_file_path>' CREDENTIALS 'aws_access_key_id=<your_key>;aws_secret_access_key=<your_secret>' delimiter '|' ignoreheader 1 maxerror as 1000 gzip;

3)This step is optional, in case you don't have good formatted data, this a your transformation step if needed(as converting String(12.555654) quantity to Number(12.56))

update  stg_sales_onetime set quantity=convert(decimal(10,2),qty_str);

4)Populating the correct IDs from dimension table.

update  stg_sales_onetime set sku_id=<your_sku_demesion_table>.sku_id  from <your_sku_demesion_table> where stg_sales_onetime.sku_number=<your_sku_demesion_table>.sku_number;
update  stg_sales_onetime set time_id=<your_time_demesion_table>.time_id  from <your_time_demesion_table> where stg_sales_onetime.time=<your_time_demesion_table>.time;

5)Finally you have data good to go from Staging to Online Sales table.

insert into sales(sku_id,time_id,quantity)  select sku_id,time_id,quantity from stg_sales_onetime;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top