Question

I'm trying to resolve ambiguity in a lowercase chemical formula. Since some element names are substrings of other element names, and they're all run together, there can be multiple global matches for the same pattern.

Considering the regex /^((h)|(s)|(hg)|(ga)|(as))+$/ against the string hgas. There are two possible matches. hg, as and h, s, ga (out of order compared to input, but not an issue). Obviously a regex for all possible symbols would be longer, but this example was done for simplicity.

Regex's powerful lookahead and lookbehind allow it to conclusively determine whether even a very long string does match this pattern or there are no possible permutations of letters. It will diligently try all possible permutations of matches, and, for example, if it hits the end of the string with a leftover g, go back through and retry a different combination.

I'm looking for a regular expression, or a language with some kind of extension, that adds on the ability to keep looking for matches after one is found, in this case, finding h, s, ga as well as hg, as.

Rebuilding the complex lookahead and lookbehind functionality of regex for this problem does not seem like a reasonable solution, especially considering the final regex also includes a \d* after each symbol.

I thought about reversing the order of the regexp, /^((as)|(ga)|(hg)|(s)|(h))+$/, to find additional mappings, but at most this will only find one additional match, and I don't have the theoretical background in regex to know if it's even reasonable to try.

I've created a sample page using my existing regex which finds 1 or 0 matches for a given lowercase string and returns it properly capitalized (and out of order). It uses the first 100 chemical symbols in its matching.

http://www.ptable.com/Script/lowercase_formula.php?formula=hgas

tl;dr: I have a regex to match 0 or 1 possible chemical formula permutations in a string. How do I find more than 1 match?

Was it helpful?

Solution

I'm well-aware this answer might be off-topic (as in the approach), but I think it is quite interesting, and it solves the OP's problem.

If you don't mind learning a new language (Prolog), then it might help you generate all possible combinations:

name(X) :- member(X, ['h', 's', 'hg', 'ga', 'as']).

parse_([], []).
parse_(InList, [HeadAtom | OutTail]) :-
    atom_chars(InAtom, InList),
    name(HeadAtom),
    atom_concat(HeadAtom, TailAtom, InAtom),
    atom_chars(TailAtom, TailList),
    parse_(TailList, OutTail).

parse(In, Out) :- atom_chars(In, List), parse_(List, Out).

Sample run:

?- parse('hgas', Out).
Out = [h, ga, s] ;
Out = [hg, as] ;
false.

The improved version, which includes processing for number is a tad bit longer:

isName(X) :- member(X, ['h', 's', 'hg', 'ga', 'as', 'o', 'c']).

% Collect all numbers, since it will not be part of element name.
collect([],[],[]).
collect([H|T], [], [H|T]) :-
    \+ char_type(H, digit), !.
collect([H|T], [H|OT], L) :-
    char_type(H, digit), !, collect(T, OT, L).

parse_([], []).
parse_(InputChars, [Token | RestTokens]) :-
    atom_chars(InputAtom, InputChars),
    isName(Token),
    atom_concat(Token, TailAtom, InputAtom),
    atom_chars(TailAtom, TailChars),
    parse_(TailChars, RestTokens).

parse_(InputChars, [Token | RestTokens]) :-
    InputChars = [H|_], char_type(H, digit),
    collect(InputChars, NumberChars, TailChars),
    atom_chars(Token, NumberChars),
    parse_(TailChars, RestTokens).

parse(In, Out) :- atom_chars(In, List), parse_(List, Out).

Sample run:

?- parse('hgassc20h245o', X).
X = [h, ga, s, s, c, '20', h, '245', o] ;
X = [hg, as, s, c, '20', h, '245', o] ;
false.

?- parse('h2so4', X).
X = [h, '2', s, o, '4'] ;
false.

?- parse('hgas', X).
X = [h, ga, s] ;
X = [hg, as] ;
false.

OTHER TIPS

The reason you haven't found a generalized regex library that does this is because it's not possible with all regular expressions to do this. There are regular expressions that will not terminate.

Imagine with your example that you just added empty string to the list of terms, then

'hgas' could be:

['hg', 'as']
['hg', '', 'as']
['hg', '', '', 'as']

You'll probably just have to roll your own.

In psuedo code:

def findall(term, possible):
    out = []

    # for all the terms
    for pos in possible:

        # if it is a candidate
        if term.startswith(pos):

            # combine it with all possible endings
            for combo in findall(term.removefrombegining(pos), possible):
                newCombo = combo.prepend(out)
                out.append(newCombo)
    return out


findall('hgas', ['h', 'as', ...])

This above will run in exponential time so dynamic programming will be the way this isn't an exponentially large problem. Memoization for the win.

The last thing worth noting is the above code doesn't check that it fully matches.

ie. 'hga' might return ['hg'] as a possibility.

I'll leave the actual coding, memrization, and this last hiccup to, as my profs lovingly say, 'an exercise to the reader'

This is not a job for regexp, you need something more like a state machine.

You'd need to parse the string popping out all known symbols, stopping if there is none, and continuing. If the whole string gets consumed on one branch, you have found a possibility.

In PHP, something like:

<?php
    $Elements = array('h','he','li','s','ga','hg','as',
         // "...and there may be many others, but they haven't been discovered".
    );

    function check($string, $found = array(), $path = '')
    {
        GLOBAL $Elements, $calls;
        $calls++;
        if ('' == $string)
        {
            if (!empty($path))
                $found[] = $path;
            return $found;
        }
        $es = array_unique(array(
                  substr($string, 0, 1), // Element of 1 letter ("H")
                  substr($string, 0, 2), // Element of 2 letter ("He")
        ));
        foreach($es as $e)
            if (in_array($e, $Elements))
                    $found  = check(substr($string, strlen($e)), $found, $path.$e.',');
        return $found;
    }

    print_r(check('hgas'));
    print "in $calls calls.\n";
?>

Don't use regex. A regex matches only 1 element as you say, instead you need to find all the possible "meanings" of your string. Given the fact that each element's lenght is 1-2 chars, I'd go with this algorythm (forgive the pseudocode):

string[][] results;



void formulas(string[] elements, string formula){
    string[] elements2=elements;


    if(checkSymbol(formula.substring(0,1))){
        elements.append(formula.substring(0,1));
        if(formula.length-1 ==0){
            results.append(elements);
        } else {
            formulas(elements,formula.substring(1,formula.length);
        }
    }
    if(checkSymbol(formula.substring(0,2))){
        elements2.append(formula.substring(0,2));
        if(formula.length-2 ==0){
            results.append(elements2);
        } else {
            formulas(elements2,formula.substring(2,formula.length);
        }
    }


}

bool checkSymbol(string symbol){
    // verifies if symbol is the name of an element
}

input "hgas" (let's go depth first)

first step: checkSymbol(formula.substring(0,1)) is true for "h"

elements2 = [h]

recursive call, if(checkSymbol(formula.substring(0,1))) false

then it tests ga => true

elements2 = [h, ga]

third recursive call

test s, checksymbol returns true, elements is then [h, ga, s]. Length of the substring is 0 so it appends to results the first array: [h, ga, s]

-- let's go back to the second "branch" of first step

The test checkSymbol(formula.substring(0,2) finds that "hg" is an element as well

elements2 = [hg]

Then we call formulas([hg],"as")

The test for "a" fails (it is not an element) and the test for "as" works, the length is totally consumed, the result [hg,as] is appended to results[]

This algorythm should run in O(n^2) time in the worst case, n being the length of the string.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top