Question

If i have a set of similar words such as:

\bigoplus
\bigotimes
\bigskip
\bigsqcup
\biguplus
\bigvee
\bigwedge
...
\zebra
\zeta 

i would like to find the shortest unique set of letters that would characterize each word uniquely i.e.

\bigop:
     \bigoplus
\bigot:
     \bigotimes
\bigsk:
     \bigskip

EDIT: notice the unique sequence identifier always starts from the begining of the word. I writting an app that gives snippet suggestions when typing. So in general users will start typing from the start of the word

and so on, the sequence needs only be as long as is enough to characterize a word uniquely. EDIT: but needs to start from the begining of the word. The characterization always begins from the beginning of the word. My thoughts: i was thinking of sorting the words, and grouping based on the fist alphabetical letter, then probably use a longest common subsequence algorithm to find the longest subsequence in common, take its length and use length+1 chars for that unique substring, but im stuck since the algorithms i know for longest subsequence will usually only take two parameters at a time, and i may have more than two words in each group starting with a particular alphabetical letter. Im i solving an already solved probelem? google was no help.

Was it helpful?

Solution

I'm assuming you want to find the prefixes that uniquely identify the strings, because if you could pick any subsequence, then for example om would be enough to identify \bigotimes in your example.

You can make use of the fact that for a given word, the word with the longest common prefix will be adjacent to it in lexicographical order. Since your dictionary seems to be sorted already, you can figure out the solution for every word by finding the longest prefix that disambiguates it from both its neighbors.

Example:

>>> lst = r"""
... \bigoplus
... \bigotimes
... \bigskip
... \bigsqcup
... \biguplus
... \bigvee
... \bigwedge
... """.split()
>>> lst.sort()      # necessary if lst is not already sorted
>>> lst = [""] + lst + [""]
>>> def cp(x): return len(os.path.commonprefix(x))
... 
>>> { lst[i]: 1 + max(cp(lst[i-1:i+1]), cp(lst[i:i+2])) for i in range(1,len(lst)-1) }
{'\\bigvee': 5, 
 '\\bigsqcup': 6, 
 '\\biguplus': 5, 
 '\\bigwedge': 5, 
 '\\bigotimes': 6, 
 '\\bigoplus': 6, 
 '\\bigskip': 6}

The numbers indicate how long the minimal uniquely identifying prefix of a word is.

OTHER TIPS

Thought I'd dump this here since it was the most similar to a question I was about to ask:

Looking for a better solution (will report back when I find one) to iterating through a sequence of strings, trying to map the shortest unique string for/to each.

For example, in a sequence of:

['blue', 'black', 'bold']
# 'blu' --> 'blue'
# 'bla' --> 'black'
# 'bo'  --> 'bold'

Looking to improve upon my first, feeble solution. Here's what I came up with:

# Note: Iterating through the keys in a dict, mapping shortest 
#       unique string to the original string.
shortest_unique_strings = {}
for k in mydict:
    for ix in range(len(k)):
        # When the list-comp only has one item.
        # 'key[:ix+1]' == the current substring
        if len([key for key in mydict if key.startswith(key[:ix+1])]) == 1:
            shortest_unique_strings[key[:ix+1]] = k
            break

Note: On improving efficiency: we should be able to remove those keys/strings that have already been found, so that successive searches don't have to repeat on those items.

Note: I specifically refrained from creating/using any functions outside of built-ins.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top