Question

I have a list of keys ['foo_a','foo_b','foo_c','fnord']

All of the similar solutions here assume that you have no fnord's in your text.

I have this code that does the job:

def detect_prefix(keys):
    PCT = 0.70 # cutof 
    pre = ''
    l = len(keys)
    for i in range(0, len(max(keys, key=len))):
        keys = filter(lambda k: k.startswith(pre), keys)
        cnt = dict()
        for k in map(lambda k: k[i], keys):
            cnt.setdefault(k,0)
            cnt[k] +=1
        if cnt[max(cnt)] / float(l) >= PCT:
            pre += max(cnt)
        else:
            break
    return pre

I have a strong suspicion that this could be done much more elegantly, but my python-fu is not strong enough right now.

I would love to hear some suggestions.

Edit. Additional background and clarification.
These are keys that other developers have put in an application to use for translation. They should have a common prefix, but people forget, and they cut and paste from other code. "_" as a prefix separator is just a convention. It would be best not to assume a separator is even used. 70% is a totally arbitrary threshold. "most prevalent" or "predominant" would have worked too.
And yes, this is python 2.7, and the space inside the quotes is just a visual artefact.

Was it helpful?

Solution

If you know the desired threshold frequency for the common prefix:

#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest

strings = ['foo_a','foo_b','foo_c','fnord']
threshold = .7 * len(strings)
prefix = []
for chars in izip_longest(*strings, fillvalue=''):
    char, count = Counter(chars).most_common(1)[0]
    if count < threshold:
        break
    prefix.append(char)
print(''.join(prefix))
# -> foo_

Or you could collect all common prefixes and their frequencies and decide later:

#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest

strings = ['foo_a', 'foo_b','foo_c','fnord']
assert len(strings) > 1
threshold = len(strings)
prefix = []
prefixes = []
for chars in izip_longest(*strings, fillvalue=''):
    char, count = Counter(chars).most_common(1)[0]
    if count == 1:
        break
    elif count < threshold:
        if prefix:
            prefixes.append((''.join(prefix), threshold))
        threshold = count
    prefix.append(char)
if prefix:
    prefixes.append((''.join(prefix), threshold))
print(prefixes)
# -> [('f', 4), ('foo_', 3)]

Both code examples assume that the predominant prefix exists i.e., the most common character at each position belongs to the most common prefix.

OTHER TIPS

A nice way to find which things have a particular prefix is a trie. I used an implementation called pytrie, but they all work fairly much the same way. The only fun bit is you still need to generate all the prefixes another way, since asking the trie for "all the prefixes of foo_a" only gives you "foo_a" and all the prefix strings of it that are part of your data, but you seem to care about "foo_" even though it isn't a key of its own. However, it can do it the other way around, telling you all the keys that start with a given prefix even if it isn't explicitly stored.

Other than that, its all fairly straightforward. Including the import, it comes in at five lines:

from pytrie import StringTrie as trie

data = trie.fromkeys(['foo_a','foo_b','foo_c','fnord'])
PCT = 0.70 
prefixes = (k[:i] for k in data for i,_ in enumerate(k, start=1))
print(max(filter(lambda x: len(data.keys(x)) >= PCT * len(data), prefixes), key=len))

prints foo_.

def det_pref(words):
    cnt =  {'':len(words)}
    for w_pfx in itertools.chain.from_iterable([[w[:i] for i in range(1,len(w)+1)] for  w in words]):
         cnt[w_pfx] = 1 + cnt.get(w_pfx,0)
    return max([w_pfx for (w_pfx,n) in cnt.items() if n/len(words)>0.7])

Caveat: as this solution does no early-out and input-reduction during the loop, it is less efficient than the original code.

Here is a more efficient approach, which is IMHO still pythonic, but harder to understand and longer than my first one:

def det_pref(words):
    cnt = dict()
    pfx_len = [len(w) for w in words]
    while any(pfx_len):
        for i,w_pfx in [(i,words[i][:l]) for i,l in enumerate(pfx_len)]:
            pfx_len[i] -= pfx_len[i] and 1 or 0
            n = 1 + cnt.get(w_pfx,0)
            if n/len(words)> 0.7:
                return w_pfx
            cnt[w_pfx] = n
    return ''
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top