Question

In order to use the uniq command, you have to sort your file first.

But in the file I have, the order of the information is important, thus how can I keep the original format of the file but still get rid of duplicate content?

Was it helpful?

Solution

Another awk version:

awk '!_[$0]++' infile

OTHER TIPS

This awk keeps the first occurrence. Same algorithm as other answers use:

awk '!($0 in lines) { print $0; lines[$0]; }'

Here's one that only needs to store duplicated lines (as opposed to all lines) using awk:

sort file | uniq -d | awk '
   FNR == NR { dups[$0] }
   FNR != NR && (!($0 in dups) || !lines[$0]++)
' - file

There's also the "line-number, double-sort" method.

 nl -n ln | sort -u -k 2| sort -k 1n | cut -f 2-

You can run uniq -d on the sorted version of the file to find the duplicate lines, then run some script that says:

if this_line is in duplicate_lines {
    if not i_have_seen[this_line] {
        output this_line
        i_have_seen[this_line] = true
    }
} else {
    output this_line
}

Using only uniq and grep:

Create d.sh:

#!/bin/sh
sort $1 | uniq > $1_uniq
for line in $(cat $1); do
cat $1_uniq | grep -m1 $line >> $1_out
cat $1_uniq | grep -v $line > $1_uniq2
mv $1_uniq2 $1_uniq
done;
rm $1_uniq

Example:

./d.sh infile

You could use some horrible O(n^2) thing, like this (Pseudo-code):

file2 = EMPTY_FILE
for each line in file1:
  if not line in file2:
    file2.append(line)

This is potentially rather slow, especially if implemented at the Bash level. But if your files are reasonably short, it will probably work just fine, and would be quick to implement (not line in file2 is then just grep -v, and so on).

Otherwise you could of course code up a dedicated program, using some more advanced data structure in memory to speed it up.

for line in $(sort file1 | uniq ); do
    grep -n -m1 line file >>out
done;

sort -n out

first do the sort,

for each uniqe value grep for the first match (-m1)

and preserve the line numbers

sort the output numerically (-n) by line number.

you could then remove the line #'s with sed or awk

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top