Frage

I have a tab limited data that reads

1 0 0 1 1 Black Swan
0 0 1 0 0 Golden Duck
1 0 0 1 0 Brown Eagle
0 0 1 0 1 Golden Duck
1 0 0 1 0 Black Swan
1 0 1 0 0 Golden Duck
1 0 0 1 1 Sparrow

The last column is a combination of one or more words separated by space. I want to count the number of unique values in the last column and replace that with a number which is unique to that group. I know I can count the and list the numbers using

awk -F '\t' '{print $NF}'  infile | sort | uniq | wc -l

But how do I replace with numbers? For example, replace all Black Swan by 1, replace all Golden Duck by 2 etc. I want the result to be :

1 0 0 1 1 1
0 0 1 0 0 2
1 0 0 1 0 3
0 0 1 0 1 2
1 0 0 1 0 1
1 0 1 0 0 2
1 0 0 1 1 4

and I also want to generate the list of numbers given to specific values like

Black Swan 1
Golden Duck 2
Brown Eagle 3
Sparrow 4
War es hilfreich?

Lösung

You can use an associate array to increment a counter for each different name:

awk '
    BEGIN { 
        FS = OFS = "\t" 
        i = 0
    }
    {
        if (! names[$NF]) {
            names[$NF] = ++i
        }
        $NF = names[$NF]
        print $0
    }
    END {
        for (name in names) {
            printf "%s %d\n", name, names[name]
        }
    }
' infile

It yields:

1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Golden Duck 2
Brown Eagle 3
Sparrow 4
Black Swan 1

Andere Tipps

I started writing this so I'll finish:

awk '
BEGIN {FS = OFS = "\t"}
{
    last[$NF] = (last[$NF] ? last[$NF] : ++cnt)
    $NF = last[$NF]
    line[NR] = $0
}
END {
    for (nr=1; nr<=NR; nr++) 
        print line[nr]
    for (name in last) 
        print name, last[name]
}' file
1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Brown Eagle     3
Black Swan      1
Sparrow         4
Golden Duck     2

Update:

Here is a perl alternate:

perl -F'\t' -lane '
    $h{$F[-1]} = ++$c unless exists $h{$F[-1]}; 
    $F[-1] = $h{$F[-1]}; 
    print join "\t", @F }{ print "$_  $h{$_}" for keys %h
' file
1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Golden Duck  2
Brown Eagle  3
Black Swan  1
Sparrow  4

Here is another update based on mpapec's excellent comment:

perl -F'\t' -lane '
    $F[-1] = $h{$F[-1]} ||= ++$c; 
    print join "\t", @F }{ print "$_  $h{$_}" for keys %h
' file 

What you want to do is create a set of the unique data. A set is a dictionary, or hash table, with all unique elements. After you create your set, you can search through it and replace the string with the appropriate value.

Here is another link for sets to help you out:

http://world.std.com/~swmcd/steven/perl/pm/set.html

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top