How to keep a file's format if you use the uniq command (in shell)?
-
22-07-2019 - |
Question
In order to use the uniq command, you have to sort your file first.
But in the file I have, the order of the information is important, thus how can I keep the original format of the file but still get rid of duplicate content?
Solution
Another awk version:
awk '!_[$0]++' infile
OTHER TIPS
This awk
keeps the first occurrence. Same algorithm as other answers use:
awk '!($0 in lines) { print $0; lines[$0]; }'
Here's one that only needs to store duplicated lines (as opposed to all lines) using awk
:
sort file | uniq -d | awk '
FNR == NR { dups[$0] }
FNR != NR && (!($0 in dups) || !lines[$0]++)
' - file
There's also the "line-number, double-sort" method.
nl -n ln | sort -u -k 2| sort -k 1n | cut -f 2-
You can run uniq -d on the sorted version of the file to find the duplicate lines, then run some script that says:
if this_line is in duplicate_lines {
if not i_have_seen[this_line] {
output this_line
i_have_seen[this_line] = true
}
} else {
output this_line
}
Using only uniq and grep:
Create d.sh:
#!/bin/sh
sort $1 | uniq > $1_uniq
for line in $(cat $1); do
cat $1_uniq | grep -m1 $line >> $1_out
cat $1_uniq | grep -v $line > $1_uniq2
mv $1_uniq2 $1_uniq
done;
rm $1_uniq
Example:
./d.sh infile
You could use some horrible O(n^2) thing, like this (Pseudo-code):
file2 = EMPTY_FILE
for each line in file1:
if not line in file2:
file2.append(line)
This is potentially rather slow, especially if implemented at the Bash level. But if your files are reasonably short, it will probably work just fine, and would be quick to implement (not line in file2
is then just grep -v
, and so on).
Otherwise you could of course code up a dedicated program, using some more advanced data structure in memory to speed it up.
for line in $(sort file1 | uniq ); do
grep -n -m1 line file >>out
done;
sort -n out
first do the sort,
for each uniqe value grep for the first match (-m1)
and preserve the line numbers
sort the output numerically (-n) by line number.
you could then remove the line #'s with sed or awk