Rimuovere i duplicati con avvertenze

https://stackoverflow.com/questions/150891

02-07-2019
|

Domanda

Ho una tabella con rowID, longitudine, latitudine, businessName, url, didascalia. Questo potrebbe apparire come:

rowID | long  | lat |  businessName | url | caption

  1      20     -20     Pizza Hut   yum.com  null

Come posso eliminare tutti i duplicati, ma mantenere solo quello che ha un URL (prima priorità), o mantenere quello che ha una didascalia se l'altro non ha un URL (seconda priorità) ed eliminare il riposare?

Soluzione

Ecco la mia tecnica di looping. Probabilmente questo verrà annullato per non essere mainstream - e io sono forte con quello.

DECLARE @LoopVar int

DECLARE
  @long int,
  @lat int,
  @businessname varchar(30),
  @winner int

SET @LoopVar = (SELECT MIN(rowID) FROM Locations)

WHILE @LoopVar is not null
BEGIN
  --initialize the variables.
  SELECT 
    @long = null,
    @lat = null,
    @businessname = null,
    @winner = null

  -- load data from the known good row.  
  SELECT
    @long = long,
    @lat = lat,
    @businessname = businessname
  FROM Locations
  WHERE rowID = @LoopVar

  --find the winning row with that data
  SELECT top 1 @Winner = rowID
  FROM Locations
  WHERE @long = long
    AND @lat = lat
    AND @businessname = businessname
  ORDER BY
    CASE WHEN URL is not null THEN 1 ELSE 2 END,
    CASE WHEN Caption is not null THEN 1 ELSE 2 END,
    RowId

  --delete any losers.
  DELETE FROM Locations
  WHERE @long = long
    AND @lat = lat
    AND @businessname = businessname
    AND @winner != rowID

  -- prep the next loop value.
  SET @LoopVar = (SELECT MIN(rowID) FROM Locations WHERE @LoopVar < rowID)
END

Altri suggerimenti

Questa soluzione è offerta da "cose ??che ho imparato su Stack Overflow" nell'ultima settimana:

DELETE restaurant
WHERE rowID in 
(SELECT rowID
    FROM restaurant
    EXCEPT
    SELECT rowID 
    FROM (
        SELECT rowID, Rank() over (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC ) AS Rank
        FROM restaurant
        ) rs WHERE Rank = 1)

Avviso: non l'ho testato su un vero database

Soluzione basata su set:

delete from T as t1
where /* delete if there is a "better" row
         with same long, lat and businessName */
  exists(
    select * from T as t2 where
      t1.rowID <> t2.rowID
      and t1.long = t2.long
      and t1.lat = t2.lat
      and t1.businessName = t2.businessName 
      and
        case when t1.url is null then 0 else 4 end
          /* 4 points for non-null url */
        + case when t1.businessName is null then 0 else 2 end
          /* 2 points for non-null businessName */
        + case when t1.rowID > t2.rowId then 0 else 1 end
          /* 1 point for having smaller rowId */
        <
        case when t2.url is null then 0 else 4 end
        + case when t2.businessName is null then 0 else 2 end
        )

delete MyTable
from MyTable
left outer join (
        select min(rowID) as rowID, long, lat, businessName
        from MyTable
        where url is not null
        group by long, lat, businessName
    ) as HasUrl
    on MyTable.long = HasUrl.long
    and MyTable.lat = HasUrl.lat
    and MyTable.businessName = HasUrl.businessName
left outer join (
        select min(rowID) as rowID, long, lat, businessName
        from MyTable
        where caption is not null
        group by long, lat, businessName
    ) HasCaption
    on MyTable.long = HasCaption.long
    and MyTable.lat = HasCaption.lat
    and MyTable.businessName = HasCaption.businessName
left outer join (
        select min(rowID) as rowID, long, lat, businessName
        from MyTable
        where url is null
            and caption is null
        group by long, lat, businessName
    ) HasNone 
    on MyTable.long = HasNone.long
    and MyTable.lat = HasNone.lat
    and MyTable.businessName = HasNone.businessName
where MyTable.rowID <> 
        coalesce(HasUrl.rowID, HasCaption.rowID, HasNone.rowID)

Simile a un'altra risposta, ma si desidera eliminare in base al numero di riga anziché al rango. Mescola anche con espressioni di tabella comuni:


;WITH GroupedRows AS
(   SELECT rowID, Row_Number() OVER (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC) rowNum 
    FROM restaurant
)
DELETE r
FROM restaurant r
JOIN GroupedRows gr ON r.rowID = gr.rowID
WHERE gr.rowNum > 1

Se possibile, puoi omogeneizzare, quindi rimuovere i duplicati?

Passaggio 1:

UPDATE BusinessLocations
SET BusinessLocations.url = LocationsWithUrl.url
FROM BusinessLocations
INNER JOIN (
  SELECT long, lat, businessName, url, caption
  FROM BusinessLocations 
  WHERE url IS NOT NULL) LocationsWithUrl 
    ON BusinessLocations.long = LocationsWithUrl.long
    AND BusinessLocations.lat = LocationsWithUrl.lat
    AND BusinessLocations.businessName = LocationsWithUrl.businessName

UPDATE BusinessLocations
SET BusinessLocations.caption = LocationsWithCaption.caption
FROM BusinessLocations
INNER JOIN (
  SELECT long, lat, businessName, url, caption
  FROM BusinessLocations 
  WHERE caption IS NOT NULL) LocationsWithCaption 
    ON BusinessLocations.long = LocationsWithCaption.long
    AND BusinessLocations.lat = LocationsWithCaption.lat
    AND BusinessLocations.businessName = LocationsWithCaption.businessName

Passaggio 2: Rimuovi duplicati.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow