Question

I would like to transform a two column file into a table of zero and ones, in order to be ready to PCA (principal component analysis). Input file is composed by bacteria name in the first column and bacteria descriptor in the second column.

Possible way: store input file in a hash, then do some kind of 'uniq' command on each column and add them to output file. To finish, for each combination in output file, add 0 or one if bacteria name and the descriptor are found in the file 1 hash.

Input file (tab-delimited):

bacteria_1  protein:plasmid:149679
bacteria_1  protein:proph:183386
bacteria_2  protein:proph:183386
bacteria_3  protein:plasmid:147856
bacteria_3  protein:proph:183386

Desired output (tab-delimited):

    protein:plasmid:149679  protein:proph:183386    protein:plasmid:147856
bacteria_1  1   1   0
bacteria_2  0   1   0
bacteria_3  0   1   1
Was it helpful?

Solution

Here is one way with GNU awk:

awk '{
    header[$2]++; 
    bacteria[$1]++; 
    map[$1,$2]++
}
END { 
    x=asorti(header,header_s); 
    for(i=1;i<=x;i++) { 
        printf "\t%s\t", header_s[i]   
    }
    print ""
    y=asorti(bacteria,bacteria_s); 
    for(j=1;j<=y;j++) { 
        printf "%s\t\t", bacteria_s[j]; 
        for (z=1;z<=x;z++) {
            printf "%s\t\t\t\t", (map[bacteria_s[j],header_s[z]])?"1":"0"
        } 
    print ""
    } 
}' file
        protein:plasmid:147856          protein:plasmid:149679          protein:proph:183386
bacteria_1              0                               1                               1
bacteria_2              0                               0                               1
bacteria_3              1                               0                               1

Here is a solution with regular awk:

awk '
!is_present[$1]++ {bacteria[++x] = $1}
!is_present[$2]++ {protein[++y] = $2}
{map[$1,$2]++}
END { 
    for(i=1; i<=y; i++) {
        printf "\t%s\t", protein[i]
    } 
    print ""; 
    for(j=1; j<=x; j++) { 
        printf "%s\t\t", bacteria[j]; 
        for(a=1; a<=y; a++) { 
        printf "%s\t\t\t\t", (map[bacteria[j], protein[a]])?"1":"0"
        }
    print ""
    }
}' file

OTHER TIPS

Quick python script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import fileinput
from collections import defaultdict

output = defaultdict(list)
proteins = set()

for line in fileinput.input():
    bacteria, protein = line.strip().split()
    proteins.update([protein])
    output[bacteria].append(protein)

# Print header
print ' '*12,
for header in sorted(proteins):
    print '{:25}'.format(header),
print

# Print table
for key in output:
    print '{:12}'.format(key),
    for header in sorted(proteins):
        if header in output[key]:
            print '{:22}'.format(1),
        else:
            print '{:22}'.format(0),
    print

Outputs:

$ python table.py inputfile
             protein:plasmid:147856    protein:plasmid:149679    protein:proph:183386
bacteria_2                        0                      0                      1
bacteria_3                        1                      0                      1
bacteria_1                        0                      1                      1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top