Question

I am trying to find common IDs present in two files and print out the result into a new file appending the additional lines corresponding to those IDs.How can I do this?

Input file1.txt

F775_23607  EMT15298  GO:0003674    molecular_function PF08268  345
F775_00510  EMT20601  GO:0005515    protein binding    PF08268  456
F775_00510  EMT23774  GO:0003674    molecular_function PF00646  134
F775_00510  EMT23774  GO:0005515    protein binding    PF03106  888
F775_23182  EMT33502  GO:0003677    DNA binding    PF03106  789

Input file2.txt

contig15    EMT15298  95.27 148
contig18    EMT04099  97.95 293 
contig18    EMT20601  92.83 293 
contig18    EMT23062  93.17 293

Desired output file (I want to be able to decide which lines to print and which not)

EMT15298  GO:0003674 molecular_function PF08268
EMT20601  GO:0005515 protein binding    PF08268

My script (which, basically, prints only the column, which is in common)

fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("results.txt",'w')
fileA.next()

setA = set()
for line1 in fileA:
    listA = line1.split('\t')
    setA.add(listA[1])
setB = set()
for line1 in fileB:
    listB = line1.split('\t')
    setB.add(listB[1])
for key in setA & setB:
    output.writelines(key+'\n')
Was it helpful?

Solution 2

You can use dicts instead of sets:

fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("results.txt",'w')

dictA = dict()
for line1 in fileA:
    listA = line1.split('\t')
    dictA[listA[1]] = listA

dictB = dict()
for line1 in fileB:
    listB = line1.split('\t')
    dictB[listB[1]] = listB

for key in set(dictA).intersection(dictB):
    output.write(dictA[key][1] + '\t' + dictA[key][2] + '\t' + dictA[key][3] + '\t' + dictA[key][4] + '\n')

OTHER TIPS

Since your first text file contains all of the "fields" for the output we can reduce the logic and number of steps slightly.

First we open the two input files and read them into lists:

with open('file1.txt', 'r') as a, open('file2.txt','r') as b:
    fileA = [l.rstrip('\n').split('\t')[1:5] for l in a.readlines()]
    fileB = [l.rstrip('\n').split('\t')[1:] for l in b.readlines()]

So now we have two lists, fileA and fileB. You'll notice the slice notation on both of them. Since fileA has all of the values you want for the output it is now ready, it just needs filtered against the second list. I've also removed the first item from both lists so we can use the EMT... values for comparison.

Now we can check if fileB contains (not in it's entirety) fileA and write the matches to the results file:

with open('results.txt','w') as o:
    for line in fileA:
        if any(line[0] in l for l in fileB):
            o.write('%s\n' % '\t'.join(line))

results.txt is once again tab-delimited with the corresponding matches:

EMT15298    GO:0003674  molecular_function  PF08268
EMT20601    GO:0005515  protein binding PF08268

If you just want to do a "join" operation you can use unix join command specifying second column, for a tab delimited file it would be just like:

join file1.txt file2.txt -j2

You need to have the rows sorted, otherwise it will not work, however you can also use the sort command also available.

In addition, to select the columns you want to use you can use a pipe to the cut function:

join file1.txt file2.txt -j2 | cut -f2,3,4,5
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top