How to check folder of text files for duplicate URLs

https://stackoverflow.com/questions/22986231

01-07-2023
|

Question

I have a folder with *.txt files. I want to regularly check those files for duplicate URLs.

Actually, I save my bookmarks in these files, always with at least two lines, such as:

www.domain.com
Quite a popular domain name

As it happens, I save the same URL with another description, such as:

www.domain.com
I should buy this domain
Whenever I happen to have enough money for this

All entries are separated by single blank lines. And sometimes the URLs are in markdown format:

[domain.com](www.domain.com)

How would I crawl the folder for duplicate URLs?

The only solution I found so far is cat in combination with it's uniq pipe:

cat folder/* |sort|uniq|less > dupefree.txt

The problem with this is:

This does only check full identical lines - markdown URLs are ignored and the connected comments lost
I don't want to output a cleaned text file but just need a hint which URLs are duplicates

How can I do a proper duplicate check?

Solution

Here is the source file I made from your description

cat file

www.domain.com
Quite a popular domain name

www.domain.com
I should buy this domain
Whenever I happen to have enough money for this
All entries are separated by single blank lines. And sometimes the URLs are in markdown format:

[domain.com](www.domain.com)
How would I crawl the folder for duplicate URLs?

Using awk to export the duplicate domain name:

awk 'BEGIN{FS="\n";RS=""}
{ if ($1~/\[/) { split($1,a,"[)(]"); domain[a[2]]++}
  else {domain[$1]++}
}
END{ for (i in domain) 
      if (domain[i]>1) print "Duplicate domain found: ",i
    }' file

Duplicate domain found:  www.domain.com

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow