Question

I wrote a piece of code that finds common ID's in line[1] of two different files.My input file is huge (2 mln lines). If I split it into many small files it gives me more intersecting ID's, while if I throw the whole file to run, much less. I cannot figure out why, can you suggest me what is wrong and how to improve this code to avoid the problem?

fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')

dictA = dict()
for line1 in fileA:
    listA = line1.split('\t')
    dictA[listA[1]] = listA

dictB = dict()
for line1 in fileB:
    listB = line1.split('\t')
    dictB[listB[1]] = listB

for key in dictB:
    if key in dictA:
        output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])

My file1 is sorted by line[0] and has 0-15 lines,

contig17    GRMZM2G052619_P03  98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33    AT2G41790.1        98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98    GRMZM5G888620_P01  87 470 1 0 17 28 78.8 1 127 7 420 2 522 18  
contig102   GRMZM5G886789_P02  73 115 1 0 34 45 78.8 0 134 5 421 0 456 50  
contig123   AT3G57470.1        83 201 2 1 12 43 78.8 0 134 9 420 0 305 50

My file2 is not sorted and has 0-10 line,

GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525  1        
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589  4    
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0    

My desired output,

contig17    GRMZM2G052619_P03  GO:0043531 ADP binding molecular_function PF07525
contig98    GRMZM5G888620_P01  GO:0011551 DNA binding molecular_function PF07589 
contig102   GRMZM5G886789_P02  GO:0055516 ADP binding molecular_function PF07526  
Was it helpful?

Solution

I really recommend you to use PANDAS to cope with this kind of problem.

for proof that can be simply done with pandas:

import pandas as pd  #install this, and read de docs
from StringIO import StringIO #You dont need this

#simulating a reading the file 
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""

#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""

#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file), 
                     header=None, 
                     sep=" ", 
                     names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file), 
                     header=None, 
                     sep=" ", 
                     names=['d', 'e', 'f'])
#this is the hard bit. Here I am using  a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames. 
my_df = s_df[s_df.e.isin(f_df.b)]

Output: Out[180]:

    d   e                   f
0   y   GRMZM2G052619_P03   y
1   y   GRMZM5G888620_P01   y
2   y   GRMZM5G886789_P02   y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")

chers!

OTHER TIPS

This is almost the same but within a function.

#Creates a function to do the reading for each file
def read_store(file_, dictio_): 
    """Given a file name and a dictionary stores the values
    of the file in a dictionary by its value on the column provided."""
    import re 
    with open(file_,'r') as file_0:
        lines_file_0 = fileA.readlines()
    for line in lines_file_0:
        ID = re.findall("^.+\s+(\w+)", line) 
    #I couldn't check it but it should match whatever is after a separate
    # character that has letters, numbers or underscore
        dictio_[ID] = line

To use do:

file1 = {}
read_store("file1.txt", file1)

And then compare it normally as you do, but I would to use \s instead of \t to split. Even though it will split also between words, but that is easy to rejoin with " ".join(DictA[1:5])

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top