Question

I have two huge text files (size from 1 to 5 GB) and I have to compute the difference between them using shell commands. The problem is that, for each line of these files, I have to ignore some characters at specific positions.

At first time I used diff, cutting what I have to ignore from the two files:

diff <(cut -b1-58,69- FILE1.TXT) <(cut -b1-58,69- FILE2.TXT) > OUTPUT.TXT;

but the problem is that the output file does not contain cut characters.

Example


FILE1:

0123456311020130123456

aaaaaaA31102013aaaaaaA

bbbbbbb31102013bbbbbbb


FILE2:

0123456310920130123456

1111111310920131111111

aaaaaaB31092013aaaaaaB

bbbbbbb31092013bbbbbbb


the diff must ignore the characters from 8 to 15 of each line, returning in output also the new line:


OUTPUT:

1111111310920131111111

aaaaaaB31092013aaaaaaB


Can anybody help me?

Many thanks, Francesco

Was it helpful?

Solution 2

Thanks to the python hint, I did it:

            file1 = open("FILE1.TXT", "r")
            a={}
            for line1 in file1:
                str = line1[:59] + line1[68:-1]
                a[str] = 1
            file1.close()

            file2 = open("FILE2.TXT", "r")
            out = open("OUTPUT.TXT", "w")
            for line2 in file2:
                str = line2[:59] + line2[68:-1]
                if not a.has_key(str):
                    out.write(line2[:-1])
            out.close()
            file2.close()

With two big files (2.8GB) it takes about 20 sec.

Thanks folks!

OTHER TIPS

diff probably isn't the right tool for this, since you are only interested in comparing part of each line and only want output from the second file. You'll need to write your own comparison script, which is made easier because you are only interested in differences between corresponding lines in each file. An example in Python:

with open("FILE1.TXT", "r") as f1:
    with open("FILE2.TXT", "r") as f2:
        for line1, line2 in zip(f1, f2):
            if (line1[:57] != line2[:57] or
                line1[68:] != line2[68:]):
                print line2
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top