Domanda

I have a large database of URLs, and I have duplicates from trailing slashes. I would like find duplicate values of those ending with a trailing slash, but not urls with text after the trailing slash, such as http://www.google.com/asdfasdf

CREATE TABLE link_info (
  id INT,
  url VARCHAR(32)
);

INSERT INTO link_info VALUES
(1, 'http://www.yahoo.com/'),
(2, 'http://www.google.com/'),
(3, 'http://www.google.com/asdfasdf'),
(4, 'http://www.yahoo.com');

And I am trying to select duplicates without the trailing slash, but it selects http://www.google.com/asdfasdf as a duplicate.

SELECT DISTINCT TRIM(TRAILING '/' FROM url) url
FROM link_info

I was hoping to use regexp, but that doesn't work.

SELECT DISTINCT TRIM(TRAILING REGEXP('[/]$') FROM url) url
FROM link_info
È stato utile?

Soluzione

Your query will return every url trimmed. I think you need something like this:

SELECT TRIM(TRAILING '/' FROM url) trimmed_url
FROM link_info
GROUP BY trimmed_url
HAVING COUNT(DISTINCT url)>1

Please see fiddle here.

Edit

If there are no exact duplicates, and you just want to keep the row with no trailing slash, you could use this delete query:

DELETE l1.*
FROM
  link_info l1 INNER JOIN link_info l2
  ON l1.url = CONCAT(l2.url, '/')

Please see fiddle here. Notice that this query will just remove the duplicated yahoo.com with the trailing slash, but it won't remove the trailing slash from www.google.com/

Altri suggerimenti

you may use this

 SELECT  TRIM(TRAILING '/' FROM url) url
 FROM link_info
 group by SUBSTRING_INDEX(url, '.com', 1)

But this works only whith links which have .com so with .net or .something you add a union

DEMO HERE

Try:

select *
  from link_info
 where url in
       (select url
          from link_info
         group by case
                    when replace(url, substring_index(url, '.', 1), '') like '%/' then
                     replace(url, substring_index(url, '.', 1), '')
                    else
                     concat(replace(url, substring_index(url, '.', 1), ''),
                            '/')
                  end
        having count(*) > 1)
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top