Question

I have a list of URLs, most of which are duplicates:

> http://example.com/some/a-test-link.html
> http://example.com/some/a-test-link.html
> http://example.com/some/another-link.html
> http://example.com/some/another-link.html
> http://example.com/some/again-link.html
> http://example.com/some/again-link.html

I don't need the same link twice, so I need to remove duplicates and keep only one link. How can I do this using regular expressions, or sed, or awk (I am not sure which technology would be best). I am using Ubuntu as the operating system and Sublime Text 3 as my editor.

Was it helpful?

Solution

Very trivial using awk:

awk '!seen[$0]++' file

which basically means:

awk "!($0 in seen) {seen[$0];print}"

So if the line is not in the array it will add to it and print it. All subsequent lines if they exist in the array will be skipped.

$ cat file
> http://example.com/some/a-test-link.html
> http://example.com/some/a-test-link.html
> http://example.com/some/another-link.html
> http://example.com/some/another-link.html
> http://example.com/some/again-link.html
> http://example.com/some/again-link.html
$ awk '!seen[$0]++' file
> http://example.com/some/a-test-link.html
> http://example.com/some/another-link.html
> http://example.com/some/again-link.html

OTHER TIPS

$ sort -u file
> http://example.com/some/again-link.html
> http://example.com/some/another-link.html
> http://example.com/some/a-test-link.html

This might work for you (GNU sed):

sed -r 'G;/(http[^\n]*)\n.*\1/d;s/\n.*//;H' file

Use the hold space to hold previously seen URL's and delete lines which contain duplicates.

You could also use a combination of sort and uniq:

sort input.txt | uniq

Sorting groups the duplicate links and uniq deletes all consecutive repeated links.

Not sure if this works for you, but, if the links are in the order you've posted, the following regex will give you just unique results.

/(http:\/\/.*?)\s+(?:\1)/gm

http://regex101.com/r/zB0pW3

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top