문제

I have been using the Python difflib library to find where 2 documents differ. The Differ().compare() method does this, but it is very slow - atleast 100x slower for large HTML documents compared to the diff command.

How can I efficiently determine where 2 documents differ in Python? (Ideally I am after the positions rather the actual text, which is what SequenceMatcher().get_opcodes() returns.)

도움이 되었습니까?

해결책

a = open("file1.txt").readlines()
b = open("file2.txt").readlines()
count = 0
pos = 0

while 1:
    count += 1
    try:
        al = a.pop(0)
        bl = b.pop(0)
        if al != bl:
            print "files differ on line %d, byte %d" % (count,pos)
        pos += len(al)
    except IndexError:
        break

다른 팁

Google has a diff library for plain text with a python API, which should apply to the html documents you want to work with. I am not sure if it is suited for your particular use case where you are specifically interested in the location of the differences, but it is worth having a look at.

An ugly and stupid solution: If diff is faster, use it; through a call from python via subprocess, parse the command output for the information you need. This won't be as fast as just diff, but maybe faster than difflib.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top