Find most common prefix of strings - a better way?

Question 1

If you know the desired threshold frequency for the common prefix:

#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest

strings = ['foo_a','foo_b','foo_c','fnord']
threshold = .7 * len(strings)
prefix = []
for chars in izip_longest(*strings, fillvalue=''):
    char, count = Counter(chars).most_common(1)[0]
    if count < threshold:
        break
    prefix.append(char)
print(''.join(prefix))
# -> foo_

Or you could collect all common prefixes and their frequencies and decide later:

#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest

strings = ['foo_a', 'foo_b','foo_c','fnord']
assert len(strings) > 1
threshold = len(strings)
prefix = []
prefixes = []
for chars in izip_longest(*strings, fillvalue=''):
    char, count = Counter(chars).most_common(1)[0]
    if count == 1:
        break
    elif count < threshold:
        if prefix:
            prefixes.append((''.join(prefix), threshold))
        threshold = count
    prefix.append(char)
if prefix:
    prefixes.append((''.join(prefix), threshold))
print(prefixes)
# -> [('f', 4), ('foo_', 3)]

Both code examples assume that the predominant prefix exists i.e., the most common character at each position belongs to the most common prefix.

Question 2

A nice way to find which things have a particular prefix is a trie. I used an implementation called pytrie, but they all work fairly much the same way. The only fun bit is you still need to generate all the prefixes another way, since asking the trie for "all the prefixes of foo_a" only gives you "foo_a" and all the prefix strings of it that are part of your data, but you seem to care about "foo_" even though it isn't a key of its own. However, it can do it the other way around, telling you all the keys that start with a given prefix even if it isn't explicitly stored.

Other than that, its all fairly straightforward. Including the import, it comes in at five lines:

from pytrie import StringTrie as trie

data = trie.fromkeys(['foo_a','foo_b','foo_c','fnord'])
PCT = 0.70 
prefixes = (k[:i] for k in data for i,_ in enumerate(k, start=1))
print(max(filter(lambda x: len(data.keys(x)) >= PCT * len(data), prefixes), key=len))

prints foo_.

Question 3

def det_pref(words):
    cnt =  {'':len(words)}
    for w_pfx in itertools.chain.from_iterable([[w[:i] for i in range(1,len(w)+1)] for  w in words]):
         cnt[w_pfx] = 1 + cnt.get(w_pfx,0)
    return max([w_pfx for (w_pfx,n) in cnt.items() if n/len(words)>0.7])

Caveat: as this solution does no early-out and input-reduction during the loop, it is less efficient than the original code.

Here is a more efficient approach, which is IMHO still pythonic, but harder to understand and longer than my first one:

def det_pref(words):
    cnt = dict()
    pfx_len = [len(w) for w in words]
    while any(pfx_len):
        for i,w_pfx in [(i,words[i][:l]) for i,l in enumerate(pfx_len)]:
            pfx_len[i] -= pfx_len[i] and 1 or 0
            n = 1 + cnt.get(w_pfx,0)
            if n/len(words)> 0.7:
                return w_pfx
            cnt[w_pfx] = n
    return ''

Find *most* common prefix of strings - a better way?

Find most common prefix of strings - a better way?