문제

wondering about the best way to approach this particular problem and if any libraries (python preferably, but I can be flexible if need be).

I have a file with a string on each line. I would like to find the longest common patterns and their locations in each line. I know that I can use SequenceMatcher to compare line one and two, one and three, so on and then correlate the results, but if there something that already does it?

Ideally these matches would appear anywhere on each line, but for starters I can be fine with them existing at the same offset in each line and go from there. Something like a compression library that has a good API to access its string table might be ideal, but I have not found anything so far that fits that description.

For instance with these lines:

\x00\x00\x8c\x9e\x28\x28\x62\xf2\x97\x47\x81\x40\x3e\x4b\xa6\x0e\xfe\x8b
\x00\x00\xa8\x23\x2d\x28\x28\x0e\xb3\x47\x81\x40\x3e\x9c\xfa\x0b\x78\xed
\x00\x00\xb5\x30\xed\xe9\xac\x28\x28\x4b\x81\x40\x3e\xe7\xb2\x78\x7d\x3e

I would want to see that 0-1, and 10-12 match in all lines at the same position and line1[4,5] matches line2[5,6] matches line3[7,8].

Thanks,

도움이 되었습니까?

해결책

If all you want is to find common substrings that are at the same offset in each line, all you need is something like this:

matches = []
zipped_strings = zip(s1,s2,s3)
startpos = -1
for i in len(zipped_strings):
  c1,c2,c3 = zipped_strings[i]
  # if you're not inside a match, 
  #  look for matching characters and save the match start position
  if startpos==-1 and c1==c2==c3:
    startpos = i
  # if you are inside a match, 
  #  look for non-matching characters, save the match to matches, reset startpos
  elif startpos>-1 and not c1==c2==c3:
    matches.append((startpos,i,s1[startpos:i]))
    # matches will contain (startpos,endpos,matchstring) tuples
    startpos = -1
# if you're still inside a match when you run out of string, save that match too!
if startpos>-1:
  endpos = len(zipped_strings)
  matches.append((startpos,endpos,s1[startpos:endpos]))

To find the longest common pattern regardless of location, SequenceMatcher does sound like the best idea, but instead of comparing string1 to string2 and then string1 to string3 and trying to merge the results, just get all common substrings of string1 and string2 (with get_matching_blocks), and then compare each result of that to string3 to get matches between all three strings.

다른 팁

Is your problem performance?

How big is your input?

Is the minimum strings length to match 2?

Note that your example is not correct I think as the results you expect do not match the sample strings you provided.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top