Question

I have a piece of data which is in the form of character array:

cgcgcg
aacacg
cgcaag
cgcacg
agaacg
cacaag
agcgcg
cgcaca
cacaca
agaacg
cgcacg
cgcgaa

Notice that each column consists of only two types characters. I need to transform them into integers 0 or 1, based on their percentage in the column. For instance in the 1st column, there are 8 c's and 4 a's, so c is in majority, then we need to code it as 0 and the other as 1.

Using zip() I can transpose this array in python, and get each column into a list:

In [28]: lines = [l.strip() for l in open(inputfn)]

In [29]: list(zip(*lines))
Out[29]: 
[('c', 'a', 'c', 'c', 'a', 'c', 'a', 'c', 'c', 'a', 'c', 'c'),
 ('g', 'a', 'g', 'g', 'g', 'a', 'g', 'g', 'a', 'g', 'g', 'g'),
 ('c', 'c', 'c', 'c', 'a', 'c', 'c', 'c', 'c', 'a', 'c', 'c'),
 ('g', 'a', 'a', 'a', 'a', 'a', 'g', 'a', 'a', 'a', 'a', 'g'),
 ('c', 'c', 'a', 'c', 'c', 'a', 'c', 'c', 'c', 'c', 'c', 'a'),
 ('g', 'g', 'g', 'g', 'g', 'g', 'g', 'a', 'a', 'g', 'g', 'a')]

It's not necessary to transform them strictly into integers, i.e. 'c' to '0' or 'c' to int(0) will both be ok, since we are going to write them to a tab delimited file anyway.

Was it helpful?

Solution

Something like this:

lis = [('c', 'a', 'c', 'c', 'a', 'c', 'a', 'c', 'c', 'a', 'c', 'c'),
 ('g', 'a', 'g', 'g', 'g', 'a', 'g', 'g', 'a', 'g', 'g', 'g'),
 ('c', 'c', 'c', 'c', 'a', 'c', 'c', 'c', 'c', 'a', 'c', 'c'),
 ('g', 'a', 'a', 'a', 'a', 'a', 'g', 'a', 'a', 'a', 'a', 'g'),
 ('c', 'c', 'a', 'c', 'c', 'a', 'c', 'c', 'c', 'c', 'c', 'a'),
 ('g', 'g', 'g', 'g', 'g', 'g', 'g', 'a', 'a', 'g', 'g', 'a')]
def solve(lis):
    for row in lis:
        item1, item2 = set(row)
        c1, c2 = row.count(item1), row.count(item2)
        dic = {item1 : int(c1 < c2), item2 : int(c2 < c1)}
        yield [dic[x] for x in row]
...         
>>> list(solve(lis))
[[0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]

Using collections.Counter:

from collections import Counter
def solve(lis):
    for row in lis:
        c = Counter(row)
        maxx = max(c.values())
        yield [int(c[x] < maxx) for x in row]
...         
>>> pprint(list(solve(lis)))
[[0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
 [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top