Question

I am writing a script to manipulate a text file.

First thing I want to do is check if duplicate entries exist and if so, ask the user whether we wants to keep or remove them.

I know how to display duplicate lines if they exist, but what I want to learn is just to get a yes/no answer to the question "Do duplicates exist?"

It seems uniq will return 0 either if duplicates were found or not as long as the command completed without issues.

What is that command that I can put in an if-statement just to tell me if duplicate lines exist?

My file is very simple, it is just values in single column.

Was it helpful?

Solution 2

You can use awk combined with the boolean || operator:

# Ask question if awk found a duplicate
awk 'a[$0]++{exit 1}' test.txt || (
    echo -n "remove duplicates? [y/n] "
    read answer
    # Remove duplicates if answer was "y" . I'm using `[` the shorthand
    # of the test command. Check `help [`
    [ "$answer" == "y" ] && uniq test.txt > test.uniq.txt
)

The block after the || will only get executed if the awk command returns 1, meaning it found duplicates.

However, for a basic understanding I'll also show an example using an if block

awk 'a[$0]++{exit 1}' test.txt

# $? contains the return value of the last command
if [ $? != 0 ] ; then
    echo -n "remove duplicates? [y/n] "
    read answer
    # check answer
    if [ "$answer" == "y" ] ; then
        uniq test.txt > test.uniq.txt            
    fi
fi

However the [] are not just brackets like in other programming languages. [ is a synonym for the test bash builtin command and ] it's last argument. You need to read help [ in order to understand

OTHER TIPS

I'd probably use awk to do this but, for the sake of variety, here is a brief pipe to accomplish the same thing:

$ { sort | uniq -d | grep . -qc; } < noduplicates.txt; echo $?
1
$ { sort | uniq -d | grep . -qc; } < duplicates.txt; echo $?
0

sort + uniq -d make sure that only duplicate lines (which don't have to be adjacent) get printed to stdout and grep . -c counts those lines emulating wc -l with the useful side effect that it returns 1 if it doesn't match (i.e. a zero count) and -q just silents the output so it doesn't print the line count so you can use it silently in your script.

has_duplicates()
{
  {
    sort | uniq -d | grep . -qc
  } < "$1"
}

if has_duplicates myfile.txt; then
  echo "myfile.txt has duplicate lines"
else
  echo "myfile.txt has no duplicate lines"
fi

A quick bash solution:

#!/bin/bash

INPUT_FILE=words

declare -A a 
while read line ; do
    [ "${a[$line]}" = 'nonempty' ] && duplicates=yes && break
    a[$line]=nonempty
done < $INPUT_FILE

[ "$duplicates" = yes ] && echo -n "Keep duplicates? [Y/n]" && read keepDuplicates

removeDuplicates() {
    sort -u $INPUT_FILE > $INPUT_FILE.tmp
    mv $INPUT_FILE.tmp $INPUT_FILE
}

[ "$keepDuplicates" != "Y" ] && removeDuplicates

The script reads line by line from the INPUT_FILE and stores each line in the associative array a as the key and sets the string nonempty as value. Before storing the value, it first checks whether it is already there - if it is it means it found a duplicate and it sets the duplicates flag and then it breaks out of the cycle.

Later it only checks if the flag is set and asks the user whether to keep the duplicates. If they answer anything else than Y then it calls the removeDuplicates function which uses sort -u to remove the duplicates. ${a[$line]} evaluates to the value of the associative array a for the key $line. [ "$duplicates" = yes ] is a bash builtin syntax for a test. If the test succeeds then whatever follows after && is evaluated.

But note that the awk solutions will likely be faster so you may want to use them if you expect to process bigger files.

You can do uniq=yes/no using this awk one-liner:

awk '!seen[$0]{seen[$0]++; i++} END{print (NR>i)?"no":"yes"}' file
  • awk uses an array of uniques called seen.
  • Every time we put an element in unique we increment an counter i++.
  • Finally in END block we compare # of records with unique # of records in this code: (NR>i)?
  • If condition is true that means there are duplicate records and we print no otherwise it prints yes.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top