It's not that it's "confusing the ten's and digit's place", it's that in the first matchup the ten's place isn't different, so it's considered part of the matching prefix.
For your use case, there seems to be a pretty easy solution to this ambiguity: just match all adjacent pairs, and take the minimum. Like this:
def prefix(x, y):
comp = SequenceMatcher(None, x, y)
matches = comp.get_matching_blocks()
prefix_match = matches[0]
prefix_size = prefix_match[2]
return prefix_size
pairs = zip(files, files[1:])
matches = (prefix(x, y) for x, y in pairs)
prefixlen = min(matches)
prefix = files[0][:prefixlen]
The prefix
function is pretty straightforward, except for one thing: I made it take a single tuple of two values instead of two arguments, just to make it easier to call with map
. And I used the [2]
instead of .size
because there's an annoying bug in 2.7 difflib
where the second call to get_matching_blocks
may return a tuple
instead of a namedtuple
. This won't affect the code as-is, but if you add some debugging print
s it will break.
Now, pairs
is a list of all adjacent pairs of names, created by zip
ping together names
and names[1:]
. (If this isn't clear, print(zip(names, names[1:])
. If you're using Python 3.x, you'll need to print(list(zip(names, names[1:]))
instead, because zip
returns a lazy iterator instead of a printable list.)
Now we just want to call prefix
on each of the pairs, and take the smallest value we get back. That's what min
is for. (I'm passing it a generator expression, which can be a tricky concept at first—but if you just think of it as a list comprehension that doesn't build the list, it's pretty simple.)
You could obviously compact this into two or three lines while still leaving it readable:
prefixlen = min(SequenceMatcher(None, x, y).get_matching_blocks()[0][2]
for x, y in zip(files, files[1:]))
prefix = files[0][:prefixlen]
However, it's worth considering that SequenceMatcher
is probably overkill here. It's looking for the longest matches anywhere, not just the longest prefix matches, which means it's essentially O(N^3) on the length of the strings, when it only needs to be O(NM) where M is the length of the result. Plus, it's not inconceivable that there could be, say, a suffix that's longer than the longest prefix, so it would return the wrong result.
So, why not just do it manually?
def prefixes(name):
while name:
yield name
name = name[:-1]
def maxprefix(names):
first, names = names[0], names[1:]
for prefix in prefixes(first):
if all(name.startswith(prefix) for name in names):
return prefix
prefixes(first)
just gives you 'FilePrefix10.jpg'
, 'FilePrefix10.jp',
'FilePrefix10.j, etc. down to
'F'`. So we just loop over those, checking whether each one is also a prefix of all of the other names, and return the first one that is.
And you can do this even faster by thinking character by character instead of prefix by prefix:
def maxprefix(names):
for i, letters in enumerate(zip(*names)):
if len(set(letters)) > 1:
return names[0][:i]
Here, we're just checking whether the first character is the same in all names, then whether the second character is the same in all names, and so on. Once we find one where that fails, the prefix is all characters up to that (from any of the names).
The zip
reorganizes the list of names into a list of tuples, where the first one is the first character of each name, the second is the second character of each name, and so on. That is, [('F', 'F', 'F', 'F'), ('i', 'i', 'i', 'i'), …]
.
The enumerate
just gives us the index along with the value. So, instead of getting ('F', 'F', 'F', 'F')
you get 0, ('F, 'F', F', 'F')
. We need that index for the last step.
Now, to check that ('F', 'F', 'F', 'F')
are all the same, I just put them in a set
. If they're all the same, the set will have just one element—{'F'}
, then {'i'}
, etc. If they're not, it'll have multiple elements—{'1', '2'}
—and that's how we know we've gone past the prefix.