Determine where documents differ with Python
-
22-09-2019 - |
문제
I have been using the Python difflib library to find where 2 documents differ. The Differ().compare() method does this, but it is very slow - atleast 100x slower for large HTML documents compared to the diff command.
How can I efficiently determine where 2 documents differ in Python? (Ideally I am after the positions rather the actual text, which is what SequenceMatcher().get_opcodes() returns.)
해결책
a = open("file1.txt").readlines()
b = open("file2.txt").readlines()
count = 0
pos = 0
while 1:
count += 1
try:
al = a.pop(0)
bl = b.pop(0)
if al != bl:
print "files differ on line %d, byte %d" % (count,pos)
pos += len(al)
except IndexError:
break
다른 팁
Google has a diff library for plain text with a python API, which should apply to the html documents you want to work with. I am not sure if it is suited for your particular use case where you are specifically interested in the location of the differences, but it is worth having a look at.
An ugly and stupid solution: If diff
is faster, use it; through a call from python via subprocess
, parse the command output for the information you need. This won't be as fast as just diff
, but maybe faster than difflib
.