Question

Similar: How can I delete duplicate rows in a table

I have a feeling this is impossible and I'm going to have to do it the tedious way, but I'll see what you guys have to say.

I have a pretty big table, about 4 million rows, and 50-odd columns. It has a column that is supposed to be unique, Episode. Unfortunately, Episode is not unique - the logic behind this was that occasionally other fields in the row change, despite Episode being repeated. However, there is an actually unique column, Sequence.

I want to try and identify rows that have the same episode number, but something different between them (aside from sequence), so I can pick out how often this occurs, and whether it's worth allowing for or I should just nuke the rows and ignore possible mild discrepancies.

My hope is to create a table that shows the Episode number, and a column for each table column, identifying the value on both sides, where they are different:

SELECT Episode, 
       CASE WHEN a.Value1<>b.Value1 
            THEN a.Value1 + ',' + b.Value1 
            ELSE '' END AS Value1,
       CASE WHEN a.Value2<>b.Value2 
            THEN a.Value2 + ',' + b.Value2 
            ELSE '' END AS Value2
FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode
WHERE a.Value1<>b.Value1
      OR a.Value2<>b.Value2

(That is probably full of holes, but the idea of highlighting changed values comes through, I hope.)

Unfortunately, making a query like that for fifty columns is pretty painful. Obviously, it doesn't exactly have to be rock-solid if it will only be used the once, but at the same time, the more copy-pasta the code, the more likely something will be missed. As far as I know, I can't just do a search for DISTINCT, since Sequence is distinct and the same row will pop up as different.

Does anyone have a query or function that might help? Either something that will output a query result similar to the above, or a different solution? As I said, right now I'm not really looking to remove the duplicates, just identify them.

Was it helpful?

Solution 5

A relatively simple solution that Ponies sparked:

SELECT  t.*
FROM    Table t
    INNER JOIN ( SELECT episode
                 FROM   Table
                 GROUP BY Episode
                 HAVING COUNT(*) > 1
               ) AS x ON t.episode = x.episode

And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:

=AND($C2=$C1,A2<>A1)

Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).

OTHER TIPS

Use:

  SELECT DISTINCT t.*
    FROM TABLE t
ORDER BY t.episode --, and whatever other columns

DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.

50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.

UPDATE I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:

  SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
    FROM TABLE t
ORDER BY t.episode --, and whatever other columns

There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.

Instead of typing out all 50 columns, you could do this:

select column_name from information_schema.columns where table_name = 'your table name'

then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:

select 
  count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
  col1
, col2
, col3
, ...
having count(episode) > 1

This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.

Here's where I like to use SQL to generate more SQL. This should get you started:

select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'

You'll plug in those join parameters to this query:

select * from YourTable t1 
inner join (
select 
      count(episode) 'epcount'
    , col1
    , col2
    , col3
    , ...
    from YourTable
    group by
      col1
    , col2
    , col3
    , ...
    having count(episode) > 1
) t2 on 

...plug in all those join parameters here...
select count distinct ....

Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.

I think something like this is what you want:

select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode

This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.

Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).

Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.

Generate and store a hash key for each row, designed so the hash values mirror your definition of sameness. Depending on the complexity of your rows, updating the hash might be a simple trigger on modifying the row.

Query for duplicates of the hash key, which are your "very probably" identical rows.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top