TSQL Keep valid duplicates and remove invalid duplicates

https://stackoverflow.com/questions/18977909

29-06-2022
|

Question

I've been bashing my head against this for a while now and am getting nowhere fast; the data has to remain at line level.

I want to keep the data that arrives with the earliest, duplicates are valid. Load1 represents a batchID. Not all values have duplicates

What I want to return

Code1   Code2   Code3   Load1   LoadTime
a1      a1      a1      1       2013-09-10
a1      a1      a1      1       2013-09-10
a1      a1      a1      1       2013-09-10
a2      a1      a1      2       2013-09-12
a1      a2      a1      3       2013-09-13
a1      a2      a1      3       2013-09-13

Any suggestions?

 CREATE TABLE #Test (
 Code1  varchar(10),
 Code2  varchar(10),
 Code3  varchar(10),
 Load1  varchar(10),
 LoadTime DATE
 )


  INSERT INTO #Test
  VALUES ('a1','a1','a1','1','2013-09-10') --Keep

  INSERT INTO #Test
  VALUES ('a1','a1','a1','1','2013-09-10') --Keep

  INSERT INTO #Test
  VALUES ('a1','a1','a1','1','2013-09-10') --Keep

  INSERT INTO #Test
  VALUES ('a1','a1','a1','2','2013-09-11') --Delete

  INSERT INTO #Test
  VALUES ('a2','a1','a1','2','2013-09-12') --Keep

  INSERT INTO #Test
  VALUES ('a2','a1','a1','3','2013-09-13') --Delete

  INSERT INTO #Test
  VALUES ('a1','a2','a1','3','2013-09-13') --Keep

  INSERT INTO #Test
  VALUES ('a1','a2','a1','3','2013-09-13') --Keep

  INSERT INTO #Test
  VALUES ('a1','a2','a1','4','2013-09-13')-- Delete

  INSERT INTO #Test
  VALUES ('a1','a2','a1','4','2013-09-13')-- Delete

Solution

you can use SQL Server common table expression or CTE:

with cte as (
    select
        dense_rank() over(partition by Code1, Code2, Code3 order by LoadTime, Load1 asc) as rn
    from Table1
)
delete from cte where rn > 1

sql fiddle demo

Actually this query is very easy in SQL Server, because SQL Server treats simple common table expressions as updatable views - you don't have to join cte on your original table, you can just delete from cte

OTHER TIPS

You probably want to look at row_number() or dense_rank()

It's hard to tell the logic for deleting or keeping from your sample data, but something like

;with cte as (
      select *, 
      dense_rank() over (partition by code1,code2,code3 order by loadtime) rn 
      from #test)
    delete #Test
    from #Test t
        inner join cte
            on t.Code1 = cte.Code1
            and t.Code2 = cte.Code2
            and t.Code3 = cte.Code3
            and t.Load1 = cte.Load1
            and t.LoadTime = cte.LoadTime
        where rn>1

(The join is much easier if your data has a unique ID)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow