Question

I need script that sorts a text file and remove the duplicates. Most, if not all, of the examples out there use the sort file1 | uniq > file2 approach. In the man sort though, there is an -u option that does this at the time of sorting.

Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?

Was it helpful?

Solution 2

I'm not sure that it's about availability. Most systems I've ever seen have sort and uniq as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort has the -u option.

Technically, using a linux pipe (|) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.

If you go to the source code for sort, which comes in the coreutils package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq code.

To see how it works follow the link to sort's source and see the functions below this comment:

 /* If uniquified output is turned on, output only the first of
   an identical series of lines. */

Although I believe sort -u should be faster, the performance gains are really going to be minimal unless you're running sort | uniq on huge files, as it will have to read through the entire file again.

OTHER TIPS

They should be equivalent in the simple case, but will behave differently if you're using the -k option to define only certain fields of the input line to use as sort keys. In that case, sort -u will suppress lines which have the same key even if other parts of the line differ, whereas uniq will only suppress lines that are exactly identical.

$ cat example 
foo baz
quux ping
foo bar
$ sort -k 1,1 --stable example # use just the first word as sort key
foo baz
foo bar
quux ping
$ sort -k 1,1 --stable -u example # suppress lines with the same first word
foo baz
quux ping

but

$ sort -k 1,1 --stable example | uniq
foo baz
foo bar
quux ping

One difference is 'uniq -c' can count (and print) the number of matches. You lose this ability when you use 'sort -c' for sorting.

They should be functionally equivalent, and sort -u should be more efficient.

I'm guessing the examples you're looking at simply didn't consider (or didn't have) "sort -u" as an option.

Does uniq sort? I do not think so... Because, at least on Ubuntu 18.04 and CentOS 6, it does not. It will just remove consecutive duplicates.

You can simply conduct a mini experiment.

Let the file sample.txt be:

a
a
a
b
b
b
a
a
a
b
b
b

cat sample.txt | uniq will output:

a
b
a
b

while cat sample.txt | sort -u will output:

a
b

sort | uniq may be functionally equivalent to sort -u.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top