Duplicates in Database, Help Edit My Query to Filter Them Out?

https://stackoverflow.com/questions/8312182

25-10-2019
|

Question

I have just finished my latest task of creating an RSS Feed using PHP to fetch data from a database.

I've only just noticed that a lot (if not all) of these items have duplicates and I was trying to work out how to only fetch one of each.

I had a thought that in my PHP loop I could only print out every second row to only have one of each set of duplicates but in some cases there are 3 or 4 of each article so somehow it must be achieved by the query.

Query:

SELECT * 
FROM uk_newsreach_article t1
    INNER JOIN uk_newsreach_article_photo t2
        ON t1.id = t2.newsArticleID
    INNER JOIN uk_newsreach_photo t3
        ON t2.newsPhotoID = t3.id
ORDER BY t1.publishDate DESC;

Table Structures:

uk_newsreach_article
--------------------
id | headline | extract | text | publishDate | ...

uk_newsreach_article_photo
--------------------------
id | newsArticleID | newsPhotoID

uk_newsreach_photo
------------------
id | htmlAlt | URL | height | width | ...

For some reason or another there are lots of duplicates and the only thing truely unique amongst each set of data is the uk_newsreach_article_photo.id because even though uk_newsreach_article_photo.newsArticleID and uk_newsreach_article_photo.newsPhotoID are identical in a set of duplicates, all I need is one from each set, e.g.

Sample Data

id | newsArticleID | newsPhotoID
--------------------------------
 2 |     800482746 |     7044521
10 |     800482746 |     7044521
19 |     800482746 |     7044521
29 |     800482746 |     7044521
39 |     800482746 |     7044521
53 |     800482746 |     7044521
67 |     800482746 |     7044521

I tried sticking a DISTINCT into the query along with specifying the actual columns I wanted but this didn't work.

Solution

As you have noticed, the DISTINCT operator will return every id. You could use a GROUP BYinstead.

You will have to make a decision about wich id you want to retain. In the example, I have used MINbut any aggregate function would do.

SQL Statement

SELECT MIN(t1.id), t2.newsArticleID, t2.newsPhotoID 
FROM uk_newsreach_article t1
    INNER JOIN uk_newsreach_article_photo t2
        ON t1.id = t2.newsArticleID
    INNER JOIN uk_newsreach_photo t3
        ON t2.newsPhotoID = t3.id
GROUP BY t2.newsArticleID, t2.newsPhotoID 
ORDER BY t1.publishDate DESC;

Disclaimer

Now while this would be an easy solution to your immediate problem, if you decide that duplicates should not happen, you really should consider redesigning your tables to prevent duplicates getting into your tables in the first place.

OTHER TIPS

group by all your selected columns with HAVING COUNT(*) > 1 will eleminate all duplicates like this:

SELECT * 
FROM uk_newsreach_article t1
    INNER JOIN uk_newsreach_article_photo t2
      ON t1.id = t2.newsArticleID
    INNER JOIN uk_newsreach_photo t3
      ON t2.newsPhotoID = t3.id
GROUP BY  t1.id, t1.headline, t1.extract, t1.text, t1.publishDate,
          t2.id, t2.newsArticleID, t2.newsPhotoID,
          t3.id, t3.htmlAlt, t3.URL, t3.height, t3.width
HAVING  COUNT(*) > 1
ORDER BY t1.publishDate DESC;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow