Question

I have a table full of items from different sources. Some of the sources might have the same location (in my example, different BBC news feeds would be different sources, but they all come from the BBC). Each item has a "unique" ID which can be used to identify it among others from the same location. This means that items relating to the same news story on a site but published under different feeds will have the same "unique ID", but that's not necessarily globally unique.

The problem is that I want to eliminate duplicates at display time, so that (depending which feeds you're seeing) you only get at most one version of each story, even though two or three of your feeds might contain links to it.

I have a sources table with information about each source, and location_id and location_precedence fields. I then have an items table that contains each item, its unique_id, source_id, and content. Items with the same unique_id and source location_id should appear at most once, with the highest source location_precedence winning.

I would have thought that something like:

SELECT `sources`.`name` AS `source`,
       `items`.`content`,
       `items`.`published`
FROM `items` INNER JOIN `sources`
  ON `items`.`source_id` = `sources`.`id` AND `sources`.`active` = 1
GROUP BY `items`.`unique_id`, `sources`.`location_id`
ORDER BY `sources`.`location_priority` DESC

would do the trick, but that seems to ignore the location priority field. What have I missed?


Example data:

CREATE TABLE IF NOT EXISTS `sources` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `location_id` int(10) unsigned NOT NULL,
  `location_priority` int(11) NOT NULL,
  `active` tinyint(1) unsigned NOT NULL default '1',
  `name` varchar(150) NOT NULL,
  `url` text NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `active` (`active`)
);

INSERT INTO `sources` (`id`, `location_id`, `location_priority`, `active`, `name`, `url`) VALUES
(1, 1, 25, 1, 'BBC News Front Page', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml'),
(2, 1, 10, 1, 'BBC News England', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml'),
(3, 1, 15, 1, 'BBC Technology News', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/technology/rss.xml'),
(4, 2, 0, 1, 'Slashdot', 'http://rss.slashdot.org/Slashdot/slashdot'),
(5, 3, 0, 1, 'The Daily WTF', 'http://syndication.thedailywtf.com/TheDailyWtf');

CREATE TABLE IF NOT EXISTS `items` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `source_id` int(10) unsigned NOT NULL,
  `published` datetime NOT NULL,
  `content` text NOT NULL,
  `unique_id` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `unique_id` (`unique_id`,`source_id`),
  KEY `published` (`published`),
  KEY `source_id` (`source_id`)
);

INSERT INTO `items` (`id`, `source_id`, `published`, `content`, `unique_id`) VALUES
(1,  1, '2009-12-01 16:25:53', 'Story about Subject One',                     'abc'),
(2,  2, '2009-12-01 16:21:31', 'Subject One in story',                        'abc'),
(3,  3, '2009-12-01 16:17:20', 'Techy goodness',                              'def'),
(4,  2, '2009-12-01 16:05:57', 'Further updates on Foo case',                 'ghi'),
(5,  3, '2009-12-01 15:53:39', 'Foo, Bar and Quux in court battle',           'ghi'),
(6,  2, '2009-12-01 15:52:02', 'Anti-Fubar protests cause disquiet',          'mno'),
(7,  4, '2009-12-01 15:39:00', 'Microsoft Bleh meets lukewarm reception',     'pqr'),
(8,  5, '2009-12-01 15:13:45', 'Ever thought about doing it in VB?',          'pqr'),
(9,  1, '2009-12-01 15:13:15', 'Celebrity has 'new friend'',        'pqr'),
(10, 1, '2009-12-01 15:09:57', 'Microsoft launches Bleh worldwide',           'stu'),
(11, 2, '2009-12-01 14:57:22', 'Microsoft launches Bleh in UK',               'stu'),
(12, 3, '2009-12-01 14:57:22', 'Microsoft launches Bleh',                     'stu'),
(13, 3, '2009-12-01 14:42:15', 'Tech round-up',                               'vwx'),
(14, 2, '2009-12-01 14:36:26', 'Estates 'old news' say government', 'yza'),
(15, 1, '2009-12-01 14:15:21', 'Iranian doctor 'was poisoned'',     'bcd'),
(16, 4, '2009-12-01 14:14:02', 'Apple fans overjoyed by iBlah',               'axf');

Expected content after query:

  • Story about Subject One
  • Techy goodness
  • Foo, Bar and Quux in court battle
  • Anti-Fubar protests cause disquiet
  • Microsoft Bleh meets lukewarm reception
  • Ever thought about doing it in VB?
  • Celebrity has 'new friend'
  • Microsoft launches Bleh worldwide
  • Tech round-up
  • Estates 'old news' say government
  • Iranian doctor 'was poisoned'
  • Apple fans overjoyed by iBlah

I've tried a variation on the solution by Andomar, with some success:

SELECT      s.`name` AS `source`,
            i.`content`,
            i.`published`
FROM        `items` i
INNER JOIN  `sources` s
ON          i.`source_id` = s.`id`
AND         s.`active` = 1
INNER JOIN (
  SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
  FROM `items` i
  INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
  GROUP BY `location_id`, `unique_id`
) `filter`
ON          i.`unique_id` = `filter`.`unique_id`
AND         s.`location_priority` = `filter`.`prio`
ORDER BY    i.`published` DESC
LIMIT 50

With AND s.location_priority = filter.prio things almost work as I want. Because an item can come from multiple sources with the same priority, items can be repeated. In this case, an extra GROUP BY i.unique_id on the outer query does the job, and I suppose it doesn't matter which source "wins" if priorities are equal.

I had tried with AND i.source_id = filter.source_id instead, which almost works (i.e. eliminates the extra GROUP BY) but doesn't give results from the right sources. In the example above, it gives me "Further updates on Foo case" (source "BBC News England") rather than "Foo, Bar and Quux in court battle" (source "BBC Technology News". Looking at the results of the inner query, I get:

unique_id: 'ghi'
source_id: 2
prio: 15

Note that the source ID is not correct (expected: 3).

Was it helpful?

Solution

Order by merely orders the rows, it doesn't pick among them.

One of the ways to filter out rows with a lower location_priority is to use an inner join as filter:

SELECT     s.name, i.content, i.published
FROM       items i 
INNER JOIN sources s
ON         i.source_id = s.id
AND        s.active = 1
INNER JOIN (
    SELECT unique_id, max(location_priority) as prio
    FROM items i
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
    GROUP BY unique_id) filter
ON         i.unique_id = filter.unique_id
AND        s.location_priority = filter.prio;

An alternative is a where ... in <subquery> clause, for example:

SELECT     s.name, i.content, i.published
FROM       items i 
INNER JOIN sources s
ON         i.source_id = s.id
AND        s.active = 1
WHERE      (i.unique_id, s.location_priority) IN (
    SELECT unique_id, max(location_priority)
    FROM items i
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
    GROUP BY unique_id
);

This problem is also known as "Selecting records holding a group-wide maximum." Quassnoi has written a nice article on it.

EDIT: One way to break ties with multiple sources at the same priority is a WHERE clause with a subquery. This example breaks ties on i.id DESC:

SELECT     s.name, i.unique_id, i.content, i.published
FROM       (
           SELECT unique_id, min(location_priority) as prio
           FROM items i
           INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
           GROUP BY unique_id
           ) filter
JOIN       items i
JOIN       sources s
ON         s.id = i.source_id 
           AND s.active = 1
WHERE      i.id =
           (
           SELECT   i.id
           FROM     items i
           JOIN     sources s 
           ON       s.id = i.source_id 
                    AND s.active = 1
           WHERE    i.unique_id = filter.unique_id
           AND      s.location_priority = filter.prio
           ORDER BY i.id DESC
           LIMIT 1
           )

Quassnoi also has an article on selecting records holding group-wise maximum (resolving ties) :)

OTHER TIPS

do a self join to a derived table like

select max(location_priority) from table where ...

What have I missed?

The ORDER BY happens after the GROUP BY has already reduced each group to a single row. Paul gives one resolution.

As for the problem with the query:

SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
FROM `items` i
INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
GROUP BY `location_id`, `unique_id`

source_id is neither an aggregate nor grouped. As a result, which value you get is indeterminate.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top