Question

I have a tab limited data that reads

1 0 0 1 1 Black Swan
0 0 1 0 0 Golden Duck
1 0 0 1 0 Brown Eagle
0 0 1 0 1 Golden Duck
1 0 0 1 0 Black Swan
1 0 1 0 0 Golden Duck
1 0 0 1 1 Sparrow

The last column is a combination of one or more words separated by space. I want to count the number of unique values in the last column and replace that with a number which is unique to that group. I know I can count the and list the numbers using

awk -F '\t' '{print $NF}'  infile | sort | uniq | wc -l

But how do I replace with numbers? For example, replace all Black Swan by 1, replace all Golden Duck by 2 etc. I want the result to be :

1 0 0 1 1 1
0 0 1 0 0 2
1 0 0 1 0 3
0 0 1 0 1 2
1 0 0 1 0 1
1 0 1 0 0 2
1 0 0 1 1 4

and I also want to generate the list of numbers given to specific values like

Black Swan 1
Golden Duck 2
Brown Eagle 3
Sparrow 4
Was it helpful?

Solution

You can use an associate array to increment a counter for each different name:

awk '
    BEGIN { 
        FS = OFS = "\t" 
        i = 0
    }
    {
        if (! names[$NF]) {
            names[$NF] = ++i
        }
        $NF = names[$NF]
        print $0
    }
    END {
        for (name in names) {
            printf "%s %d\n", name, names[name]
        }
    }
' infile

It yields:

1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Golden Duck 2
Brown Eagle 3
Sparrow 4
Black Swan 1

OTHER TIPS

I started writing this so I'll finish:

awk '
BEGIN {FS = OFS = "\t"}
{
    last[$NF] = (last[$NF] ? last[$NF] : ++cnt)
    $NF = last[$NF]
    line[NR] = $0
}
END {
    for (nr=1; nr<=NR; nr++) 
        print line[nr]
    for (name in last) 
        print name, last[name]
}' file
1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Brown Eagle     3
Black Swan      1
Sparrow         4
Golden Duck     2

Update:

Here is a perl alternate:

perl -F'\t' -lane '
    $h{$F[-1]} = ++$c unless exists $h{$F[-1]}; 
    $F[-1] = $h{$F[-1]}; 
    print join "\t", @F }{ print "$_  $h{$_}" for keys %h
' file
1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Golden Duck  2
Brown Eagle  3
Black Swan  1
Sparrow  4

Here is another update based on mpapec's excellent comment:

perl -F'\t' -lane '
    $F[-1] = $h{$F[-1]} ||= ++$c; 
    print join "\t", @F }{ print "$_  $h{$_}" for keys %h
' file 

What you want to do is create a set of the unique data. A set is a dictionary, or hash table, with all unique elements. After you create your set, you can search through it and replace the string with the appropriate value.

Here is another link for sets to help you out:

http://world.std.com/~swmcd/steven/perl/pm/set.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top