Question

I have a table with contents that look similar to this:

id | title
------------
1  | 5. foo
2  | 5.foo
3  | 5. foo*
4  | bar
5  | bar*
6  | baz
6  | BAZ

…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:

SELECT * FROM (
  SELECT regexp_replace(title, '[*.]+$', '') AS title
  FROM table
) AS a
GROUP BY title

However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?

Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:

title
------
5. foo
bar
baz

I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.

Was it helpful?

Solution

For any grouping you should have transitive equality, that is a ~= b, b ~= c => a ~= c.

Formulate it strictly using words and we'll try to formulate it using SQL.

For instance, which group should foo*bar go to?

Update:

This query replaces all non-alphanumerical characters with spaces and returns first title from each group:

SELECT  DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM    (
        VALUES
        (1, '5. foo'),
        (2, '5.foo'),
        (3, '5. foo*'),
        (4, 'bar'),
        (5, 'bar*'),
        (6, 'baz'),
        (7, 'BAZ')
        ) rows (id, title)

OTHER TIPS

At some time, you are going to have to define what makes a set of values belong together in a group. If that's too hard, maybe you should prohibit and inhibit the entry of fuzzy data, or if you must permit it, add a column that contains a sanitized version of the title for use by the grouping operations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top