Deleting duplicate subscribers using max uuid postgresql

https://dba.stackexchange.com/questions/264754

27-02-2021
|

Question

I am trying to eliminate duplicate subscribers from the subscribers table for a api_customer_id, this is the query that I use to get those duplicate subscribers :

select subscribers.email, count(*) from subscribers
inner join subscriptions on subscriptions.subscriber_id = subscribers.id
inner join subscription_lists on subscription_lists.id = subscriptions.subscription_list_id
where 793 in (subscribers.api_customer_id, subscription_lists.api_customer_id) 
group by subscribers.email
having count(*) > 1

I tried to delete everything that is not max(subscribers.id), but I could not because the id column type is uuid. Here is that error :

ERROR:  function max(uuid) does not exist
LINE 1: select max(subscribers.id), subscribers.email, count(*) from...
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
SQL state: 42883
Character: 8

What can I do in the meantime to get rid of the duplicates? There is created column on the subscribers, but the times are very close together that I'm fearful to use that to max/min on. Any other suggestions?

Update 1:

Per question in the comments I do have another column updated at too, they look like this :

But they're all in the same second, and it's still risky. I don't have the flexibility to change this design now, maybe I can going forward.

Solution

Here is a generic method that works as long as there is a lower-than operator on the column used to discriminate (which is the case of uuid).

WITH dups as (
  select email, id FROM...< complete this with the duplicates detection>
)
DELETE FROM subscribers USING dups
WHERE dups.email = subcribers.email
  AND dups.id < subscribers.id;

You may replace id by a creation timestamp if these timestamps are guaranteed to be different. Even if they're not different, meaning there are (email,timestamp) duplicates, they will just be not touched by this DELETE. You can then do a second pass with another DELETE, this time with the uuid as the differentiator as in the query above. If the timestamps were all different this 2nd pass will delete nothing, otherwise it will finish the job.

Also you may do this in a transaction and check if there are still duplicates and the number of deleted rows seem good before doing a COMMIT or ROLLBACK.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange