Question

I have a problem joining two large files with 5 common columns and returning the results, which are the identical 5 tuples... Here is what I exactly mean:

File1:

132.227 49202 107.21 80
132.227 49202 107.21 80
132.227 49200 107.220 80
132.227 49200 107.220 80
132.227 49222 207.171 80
132.227 49339 184.730 80
132.227 49291 930.184 80
............
............
............

The file contains a lot of lines not just those...

File 2:

46.109498000 132.227 49200 107.220 80 17 48 
46.927339000 132.227 49291 930.184 80 17 48 
47.422919000 253.123 1985 224.300 1985 17 48
48.412761000 132.253 1985 224.078 1985 17 48
48.638454000 132.127 1985 232.123 1985 17 48
48.909658000 132.227 49291 930.184 80 17 65
48.911360000 132.227 49200 107.220 80 17 231
............
............
............

Output File:

46.109498000 132.227 49200 107.220 80 17 48 
46.927339000 132.227 49291 930.184 80 17 48 
48.909658000 132.227 49291 930.184 80 17 65
48.911360000 132.227 49200 107.220 80 17 231
............
............
............

Here is the code I wrote:

with open('log1', 'r') as fl1:
    f1 = [i.split(' ') for i in fl1.read().split('\n')]

with open('log2', 'r') as fl2:
    f2 = [i.split(' ') for i in fl2.read().split('\n')]

def merging(x,y):
    list=[]
    for i in x:
        for j in range(len(i)-1):
            while i[j]==[a[b] for a in y]:
                list.append(i)
                j=j+1
    return list

f3=merging(f1,f2)

for i in f3:
    print i
Was it helpful?

Solution

I think it's file2 is filtered by file1. Right?

I assume that the file1 is not ordered. (If it's ordered, there is another efficient solution)

with open('file1') as file1, open('file2') as file2:
    my_filter = [line.strip().split() for line in file1]
    f3 = [line.strip() for line in filter(lambda x: x.strip().split()[1:5] in my_filter, file2)]

# to see f3
for line in f3:
    print line

First, build filter my_filter = [line.strip().split() for line in file1] which contains

[['132.227', '49202', '107.21', '80'], ['132.227', '49202', '107.21', '80'], ['132.227', '49200', '107.220', '80'], ['132.227', '49200', '107.220', '80'], ['132.227', '49222', '207.171', '80'], ['132.227', '49339', '184.730', '80'], ['132.227', '49291', '930.184', '80']]

then using filter, filter the data. This code works on Python 2.7 +

OTHER TIPS

I wrote this lines and they seem working:

with open('file1', 'r') as fl1:
    f1 = [i.split(' ') for i in fl1.read().split('\n')]

with open('file2', 'r') as fl2:
    f2 = [i.split(' ') for i in fl2.read().split('\n')]

for i in f2:
    for j in f1:
        if i[1]==j[0] and i[2]==j[1] and i[3]==j[2] and i[4]==j[3]:
            print i

I tried to replace

if i[1]==j[0] and i[2]==j[1] and i[3]==j[2] and i[4]==j[3]:

with:

for k in range(4):
    if i[k+1]==j[k]:
        print i

but it gave me this error:

Traceback (most recent call last): File "MERGE.py", line 10, in if i[k+1]==j[k]: IndexError: list index out of range

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top