Algorithm for largest word formed from perodic table elements

Question 1

How to express a given word as a chemical compound? Here's a dynamic programming solution. The important line is "Let progress[i] be the solution to the subproblem for word[:i]". The rest follows.

elements = "H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr Rf Db Sg Bh Hs Mt Ds Rg Cn Uut Fl Uup Lv Uus Uuo".split()

def decompose(word):
    """Express given word as chemical compound. If there are multiple solutions, return one of minimal weight."""
    progress = [False for x in range(len(word)+1)] # solution for word[:i]
    progress[0] = []

    for i in range(1, len(word)+1):
        possibles = list()
        for j in range(max(i-3,0), i):
            if progress[j] == False:
                continue
            alchemical = word[j:i].title()
            if alchemical in elements:
                possibles.append(progress[j] + [alchemical])

        if possibles:
            # choose minimal solution
            progress[i] = min(possibles, key=len)

    if progress[-1] == False:
        return False
    return "".join(progress[-1])

assert decompose("sine") == "SiNe" # rather than S-I-Ne
assert decompose("bismuth") == "BiSmUTh"
assert decompose("jam") == False

https://gist.github.com/hickford/750135db633832913703

Ran this on a whole dictionary, the longest compounds were:

nonrepresentationalism NoNRePReSeNTaTiONaLiSm
thermophosphorescence ThErMoPHOsPHoReSCeNCe
bronchoesophagoscopy BrONCHoEsOPHAgOsCoPY
hyperphosphorescence HYPErPHOsPHoReSCeNCe
hypsibrachycephalism HYPSiBrAcHYCePHAlISm

Question 2

I think a better way is to check every word in a dictionary and see if it can be made from the names of the elements. To check every permutation of the elements would be harder to program and would be less efficient.

While I agree it is easier to produce the combinations, there are just way too many of them, and as you said tend to infinity if you don't give a limit. The production of the word with the symbols would be slightly more difficult and finicky, but I don't think it would be too difficult.

E.g. when you get a word you could search through the elements looking for elements that could make up your word, then using that set of elements try and fill in the letters from start to finish. Obviously this gets harder with elements that aren't 2 letters and words that are odd in length.

You can actually use a sort of graph. Say you have 'silicon'. You can start with the letter 'S' or 'SI' which are both in the table. From there choose 'SI' because it is closer to your solution. If 'SI' does not lead to your solution you will have to come back and see if 'S' would work.

So it works as a kind of depth first search.

Question 3

Generating all words and checking if they exist is impractical. There are 118^L words of length L, a too fast growing function. 1,643,032 three symbols words !

The other way round, as suggested by Makoto, is much more efficient. Try to reconstruct every word from a dictionary. There are on the order of 250,000 English words.

Checking a word is straightforward: see if any element symbol matches the start of the word, and continue with the remaining characters.

You must try all possible matches as a match could hide another. (I mean word ABC could be matched by symbol A and then be stuck because BC is not available, but it could be that AB and C are.) So the matching function will be recursive. I don't expect an exponential explosion in this matching process.

For maximum efficiency, you will store the symbols in a trie structure http://en.wikipedia.org/wiki/Trie.

Final hint: as you are just asked to find the longest match, try the words by decreasing lengths and stop at the first match.

Here is a simple solution in Python, without any optimization. Matching is done right to left, to allow printing the sequence of symbols in post-order:

Words= ['K', 'NaH', 'NaNaNaNa', 'HaKuNa']
Symbols= ['Na','K','H']

def Match(Word):
    # Match the end of the word with every symbol
    for S in Symbols:
        # In case of a partial match, recurse on the prefix
        if S == Word or (S == Word[-len(S):] and Match(Word[:-len(S)])):
            print S,
            return True

    # No match, report failure
    return False

for W in Words:
    if Match(W):
        print


>>> 
K
Na H
Na Na Na Na

Question 4

Generate all possible words - naive approach.

Based on Generate all permutations of all lengths:

import itertools..
symbols = ['Na','K','H']
for i in range(len(symbols)):
 for word in itertools.permutations(symbols,i+1):
  print( ''.join(word) )

You can generate every possible combination, and check against a dictionary if its an acutal word. But it is inefficient and only for if you allow no repetitions of symbols.

Check if word can be built from symbols - still not perfect.

If you allow repetitions you need to check a word against the list of symbols. I propose the following:

import itertools..
words = ['K', 'NaH', 'NaNaNaNa', 'HaKuNa']
symbols = ['Na','K','H']
for i in range(len(symbols)):
 for word in itertools.permutations(symbols,i+1):
  print( ''.join(word) )


def can_be_built(word):
  pos = 0
  ret = True
  while(pos < len(word)):
   #following loop checks if word starting form `pos`
   #can be prefixed by a symbol
   symbol_match = False
   for symbol in symbols:
    if word[pos:pos+len(symbol)] == symbol:
      symbol_match = True
      break
   if symbol_match == False:
     print('Word `{}` has no match for symbol from: {}'.format(word, word[pos:]))
     ret = False
     break
   #if it does move forward
   pos += len(symbol)

  return ret

for word in words:
   print("{} can be built? {}".format(word, can_be_built(word)))

It iteratively check if a word prefix is a symbol, then moves forward until the end of word is reached.

Output:

K can be built? True
NaH can be built? True
NaNaNaNa can be built? True
Word `HaKuNa` has no match for symbol from: aKuNa
HaKuNa can be built? False

It still is not perfect.

Correct approach

As Makoto says, the prefix-check should return a list of every possible match. The algorithm should make a queue out of those matches, and check all possible paths. It would be kind of like building a graph of prefixes that match the word. And if one builds whole word you are home.

I think it is still fairly easy to correct my 2nd example, but I am out of time for the actual code. I think it's a good place for start.

Question 5

[EDIT: This post is basically just a full explanation of a DP algorithm mentioned by Niklas B. in comments on other posts.]

Testing a particular word in linear time

In a word that can be formed from chemical element names, at least one of the following must be true:

The last letter is a single-letter chemical element name AND the prefix consisting of all letters except for the last is itself a word that can be formed from chemical element names.
The last two letters are a two-letter chemical element name AND the prefix consisting of all letters except for the last two is itself a word that can be formed from chemical element names.

This suggests a nice DP algorithm, where for a given word w of length k we compute, for all 1 <= i <= k, the largest number of letters (or chemical symbols; the question is ambiguous regarding exactly what we're trying to maximise!) that the length-i prefix of w can be built from. Let's call this f(i), and say that f(i) = -1 if the length-i prefix of w cannot be built from chemical element names at all.

Let X = 1 for maximising element names, or 2 for maximising letter counts.

The recursion for the DP can be written as:

one(i) = if (w[i] is a 1-letter element name) { f(i-1) + 1 } else { -1 }
two(i) = if ("w[i-1]w[i]" is a 2-letter element name) { f(i-2) + X } else { -1 }
f(i) = -1 if i < 0
f(0) = 0
f(i) = max(one(i), two(i)) if i > 0

Then f(k) will be what we want to know: the maximum number of (letters/elements) that w can be formed from, or -1 if this is not possible.

The max() means that if only one of the ways of handling the end of the prefix works, that way will be chosen (because the other way will have a score of -1, which will always compare less), and if neither way works, we will correctly get f(i) = -1.

Assuming constant-time testing of whether single characters or character pairs are valid chemical element names (using e.g. array or hashtable lookups), we can compute whether a given word can be represented as a sequence of chemical element names in time and space proportional to its length.

Considering all words

If we are maximising the number of letters, then it probably makes the most sense to process the dictionary in decreasing length order, since the first time we encounter a word for which f(k) is not -1, we know it must be the longest and we can stop.

This can be adapted for maximising the element name count. Although we can't stop immediately, because it might be that there is a shorter word further on that nevertheless can be formed from more element names (specifically, by using more single-character ones), we still get some useful information whenever we find a new word with a greater element count than the previously best: we can update a cutoff threshold with this element count, and stop as soon as we see a word that has fewer letters than this threshold, since a length-m word can never be formed from more than m element names.

There is different approach, which might turn out to be faster: by first sorting the dictionary in alphabetical order, we can avoid recomputing f(i) on the prefix that is shared with the previous word. E.g. if carton immediately follows cart in the dictionary, then we only need to compute f(i) on the on part.

Question 6

Just want to write down my thoughts, and verified with others who professional with Algorithm, because I didn't see such question in other place.

BTW, I use Java.And here is my pseudocode:

1, Use the element symbol of Periodic Table to form a Array of String: String[] PeriTable.

2, Based on the backtracking thoughts, build a helper function called Backtracking(String result, String[] PeriTable, String[] used);

3, We iterate each element in PeriTable, and check the String it formed.

 Backtracking void(String result, String[] PeriTable, String[] used){
   for(int i=0;i<PeriTable.length;i++){
     If(!used.contain(PeriTable[i])){
      newResult=result+PeriTable[i];
     If(isEnglishWord(newResult)) {
         used.add(PeriTable[i]);
         maxlength=Math.max(maxlength,newResult.length());
         Backtracking(newResult, PeriTable, used);
         used.remove(used.length()-1);
      }
    }
    else continue;
  }
 }

However,I don't know how to build isEnglishWord(); function.

Question 7

Sorry, I am very confused by these answers. Surely the obvious answer is to sort the dictionary in alphabetical order by using nested bucket sorts up to some depth, say 8 letters, this should give you order 1 retrieval of words starting with a given sequence of letters up to 8.

Then go through the period tables doing DFS like it is a game tree. At each step you add an element and then check down the list to see if there are any words starting with that combination.

This will be relatively memory intensive, as you probably need at least 6 layers of buckets (alphabetise by the first six letters), but there are typically only 1,000,000 or so words. Since virtually every branch in the game tree will terminate after four letters, youre DFS will search the whole game tree quite fast. If you can form every word in the dictionary, it still must be the case that there are at most the same number of branches as there are words, and in practice you will only be able to get a fraction of the words int he dictionary, so this approach must be efficient.