Delete duplicate rows from large (> 100 MIo) postgresql table (truncate with condition?)

StackOverflow https://stackoverflow.com/questions/22802685

  •  26-06-2023
  •  | 
  •  

Question

Like here, I have a large table which stores all events in our systems, for one event type I have duplicate rows (mistakenly exported from another system several times). I need to delete them to clear out stats. The solution proposed above was to

  • insert the records -- without duplicates -- into a temporary table,
  • truncate the original table and insert them back in.

But in my situation I need to delete only one class of events, not all rows, which is impossible with truncate. I'm wondering whether or not I can benefit from postgres USING syntax like in this SO answer , which offers the following solution -

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

The problem is that I don't have id field in this large table. So what will be the fastest decision in this situation? DELETE + INSERT from temporary table is the only option?

Was it helpful?

Solution

You could use the ctid column as a "replacement id":

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email 
  AND user_account.ctid < ua2.ctid;

Although that raises another question: why doesn't your user_accounts table have a primary key?

But if you delete a substantial part of the rows in the table then delete will never be very efficient (and the comparison on ctid isn't a quick one either because it does not have an index). So the delete will most probably take a very long time.

For a one time operation and if you need to delete many rows, then inserting those you want to keep into an intermediate table is going to be much faster.

That method can be improved by simply keeping the intermediate table instead of copying the rows back to the original table.

-- this will create the same table including indexes and not null constraint
-- but NOT foreign key constraints!
create table temp (like user_accounts including all);

insert into temp 
select distinct ... -- this is your query that removes the duplicates
from user_accounts;

 -- you might need cascade if the table is referenced by others
drop table user_accounts;

alter table temp rename to user_accounts;

commit;

The only drawback is that you have to re-create foreign keys for the original table (fks referencing the original table and foreign keys from the original table to a different one).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top