Question

I have a huge input file that looks like this,

c651    OS05T0-00    492    749 29.07
c651    OS01T0-00    1141   1311    55.00
c1638   MLOC_8.3     27 101 72.00
c1638   MLOC_8.3     25 117 70.97
c2135   TRIUR3_3-P1  124    210 89.66
c2135   EMT17965    25  117 70.97
c1914   OS02T0-00    2  109 80.56
c1914   OS02T0-00    111    155 93.33
c1914   OS08T0-00    528    617 50.00

I would like to iterate inside each c, see if it has same elements in line[1] and print in 2 separate files

  1. c that contain same elements and
  2. that do not have same elements.

In case of c1914, since it has 2 same elements and 1 is not, it goes to file 2. So desired 2 output files will look like this, file1.txt

c1638   MLOC_8.3     27 101 72.00
c1638   MLOC_8.3     25 117 70.97

file2.txt

c651    OS05T0-00    492    749 29.07
c651    OS01T0-00    1141   1311    55.00
c2135   TRIUR3_3-P1  124    210 89.66
c1914   OS02T0-00    2  109 80.56
c1914   OS02T0-00    111    155 93.33
c1914   OS08T0-00    528    617 50.00

This is what I tried,

oh1=open('result.txt','w')
oh2=open('result2.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
    new_list=line.split()
    protein=new_list[1]
    for i in range(1,len(protein)):
        (p, c) = protein[i-1], protein[i]
        if c == p:
            new_list.append(protein)
            oh1.write(line)
        else:
            oh2.write(line)
Was it helpful?

Solution

If I understand you correctly, you want to send all lines for your input file that have a first element txt1 to your first output file if the second element txt2 of all those lines is the same; otherwise all those lines go to the second output file. Here is a program that does that.

from collections import defaultdict

# Read in file line-by-line for the first time
# Build up dictionary of txt1 to set of txt2 s
txt1totxt2 = defaultdict(set)
f=open('file.txt','r')
for line in f:
    lst = line.split()
    txt1=lst[0]
    txt2=lst[1]
    txt1totxt2[txt1].add(txt2);

# The dictionary tells us whether the second text
# is unique or not. If it's unique the set has
# just one element; otherwise the set has > 1 elts.
# Read in file for second time, sending each line
# to the appropriate output file
f.seek(0)
oh1=open('result1.txt','w')
oh2=open('result2.txt','w')

for line in f:
    lst = line.split()
    txt1=lst[0]
    if len(txt1totxt2[txt1]) == 1:
        oh1.write(line)
    else:
        oh2.write(line)

The program logic is very simple. For each txt it builds up a set of txt2s that it sees. When you're done reading the file, if the set has just one element, then you know that the txt2s are unique; if the set has more than one element, then there are at least two txt2s. Note that this means that if you only have one line in the input file with a particular txt1, it will always be sent to the first output file. There are ways round this if this is not the behaviour you want.

Note also that because the file is large, I've read it in line-by-line: lines=f.readlines() in your original program reads the whole file into memory at a time. I've stepped through it twice: the second time does the output. If this increases the run time then you can restore the lines=f.readlines() instead of reading it a second time. However the program as is should be much more robust to very large files. Conversely if your files are very large indeed, it would be worth looking at the program to reduce the memory usage further (the dictionary txt1totxt2 could be replaced with something more optimal, albeit more complicated, if necessary).

Edit: there was a good point in comments (now deleted) about the memory cost of this algorithm. To elaborate, the memory usage could be high, but on the other hand it isn't as severe as storing the whole file: rather txt1totxt2 is a dictionary from the first text in each line to a set of the second text, which is of the order of (size of unique first text) * (average size of unique second text for each unique first text). This is likely to be a lot smaller than the file size, but the approach may require further optimization. The approach here is to get something simple going first -- this can then be iterated to optimize further if necessary.

OTHER TIPS

Try this...

import collections

parsed_data = collections.OrderedDict()

with open("input.txt", "r") as fd:
    for line in fd.readlines():
        line_data = line.split()
        key = line_data[0]
        key2 = line_data[1]
        if not parsed_data.has_key(key):
            parsed_data[key] = collections.OrderedDict()
        if not parsed_data[key].has_key(key2):
            parsed_data[key][key2] = [line]
        else:
            parsed_data[key][key2].append(line)

# now process the parsed data and write result files
fsimilar = open("similar.txt", "w")
fdifferent = open("different.txt", "w")

for key in parsed_data:
    if len(parsed_data[key]) == 1:
        f = fsimilar
    else:
        f = fdifferent
    for key2 in parsed_data[key]:
        for line in parsed_data[key][key2]:
            f.write(line)
fsimilar.close()
fdifferent.close()

Hope this helps

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top