Question

I'm trying to use pyparsing to parse a chemical formula that may be nested and with non-integer stoichiometries using pyparsing. What I want is a list of each element present in the formula and its correponding total stoichiometry.

I have used the example on the pyparsing wiki as a start, and looked at fourFn.py for more ideas. I'm having trouble understanding how to use all the features in the package.

I came up with the following grammar:

from pyparsing import Word, Group, ZeroOrMore, Combine,\
     Optional, OneOrMore, ParseException, Literal, nums,\
     Suppress, Dict, Forward

caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()

element = Word( caps, lowers )
separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )

nreal = (Combine( integer + Optional( separator +\
    Optional( integer ) ))\
    | Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )

block = Forward()
groupElem = Group( element + Optional( nreal, default=1)) ^ \
     Group( parl + block + parr + Optional( nreal,default=1 ) )
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )

Non-nested formulas work as expected:

>>> formula.parseString('H2O')
([(['H', 2.0], {}), (['O', 1], {})], {})

Despite having those empty fields (which I could not find the use), I can extract the information I want.

But when I try something like:

>>> formula.parseString('C6H8(OH)4')
([(['C', 6.0], {}), (['H', 8.0], {}), ([(['O', 1], {}), (['H', 1], {}), 4.0], {})], {})

I can see the formula is correctly parsed, but I would like the outher '4' in (OH)4 to multiply the inner numbers. But I can't see how to do it.

How can one token change the value of another one?

Or how can I walk these results and make a function that, if a block has an outer number attached to it, I can compute the total number of each element inside the block?

Thanks in advance.

edit1: I believe I need something like: suppress the outer nreal on occurrences of "( block )nreal", and multiply all occurrences of nreal by the outer value...

Was it helpful?

Solution

Recursion is definitely needed to solve this. In pyparsing, you define a recursive grammar using the Forward class. See the annotations in this code sample:

from pyparsing import (Suppress, Word, nums, alphas, Regex, Forward, Group, 
                        Optional, OneOrMore, ParseResults)
from collections import defaultdict

"""
BNF for simple chemical formula (no nesting)

    integer :: '0'..'9'+
    element :: 'A'..'Z' 'a'..'z'*
    term :: element [integer]
    formula :: term+


BNF for nested chemical formula

    integer :: '0'..'9'+
    element :: 'A'..'Z' 'a'..'z'*
    term :: (element | '(' formula ')') [integer]
    formula :: term+

"""

LPAR,RPAR = map(Suppress,"()")
integer = Word(nums)

# add parse action to convert integers to ints, to support doing addition 
# and multiplication at parse time
integer.setParseAction(lambda t:int(t[0]))

element = Word(alphas.upper(), alphas.lower())
# or if you want to be more specific, use this Regex
# element = Regex(r"A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|"
#                 "G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|"
#                 "Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|"
#                 "Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]")

# forward declare 'formula' so it can be used in definition of 'term'
formula = Forward()

term = Group((element | Group(LPAR + formula + RPAR)("subgroup")) + 
                Optional(integer, default=1)("mult"))

# define contents of a formula as one or more terms
formula << OneOrMore(term)


# add parse actions for parse-time processing

# parse action to multiply out subgroups
def multiplyContents(tokens):
    t = tokens[0]
    # if these tokens contain a subgroup, then use multiplier to
    # extend counts of all elements in the subgroup
    if t.subgroup:
        mult = t.mult
        for term in t.subgroup:
            term[1] *= mult
        return t.subgroup
term.setParseAction(multiplyContents)

# add parse action to sum up multiple references to the same element
def sumByElement(tokens):
    elementsList = [t[0] for t in tokens]

    # construct set to see if there are duplicates
    duplicates = len(elementsList) > len(set(elementsList))

    # if there are duplicate element names, sum up by element and
    # return a new nested ParseResults
    if duplicates:
        ctr = defaultdict(int)
        for t in tokens:
            ctr[t[0]] += t[1]
        return ParseResults([ParseResults([k,v]) for k,v in ctr.items()])
formula.setParseAction(sumByElement)


# run some tests
tests = """\
    H
    NaCl
    HO
    H2O
    HOH
    (H2O)2
    (H2O)2OH
    ((H2O)2OH)12
    C6H5OH
    """.splitlines()
for t in tests:
    if t.strip():
        results = formula.parseString(t)
        print t, '->', dict(results.asList())

Prints out:

H -> {'H': 1}
NaCl -> {'Na': 1, 'Cl': 1}
HO -> {'H': 1, 'O': 1}
H2O -> {'H': 2, 'O': 1}
HOH -> {'H': 2, 'O': 1}
(H2O)2 -> {'H': 4, 'O': 2}
(H2O)2OH -> {'H': 5, 'O': 3}
((H2O)2OH)12 -> {'H': 60, 'O': 36}
C6H5OH -> {'H': 6, 'C': 6, 'O': 1}

OTHER TIPS

I guess I have found the solution myself. I had to create a recursive function to analyze the results and output the list as I wanted it, with each element and its stoichiometry without nesting. I had to modify slightly my starting code, and use named results for my purposes:

from pyparsing import Word, Group, ZeroOrMore, Combine,\
     Optional, OneOrMore, ParseException, Literal, nums,\
     Suppress, Dict, Forward

caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()

element = Word( caps, lowers )
separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )

nreal = (Combine( integer + Optional( separator +\
    Optional( integer ) ))\
    | Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )



block = Forward()
groupElem = (Group( element('elem') + Optional( nreal, default=1)('esteq') ))('dupla') | \
     Group( parl + block + parr + Optional( nreal,default=1 )('modi'))
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )

Here is my function. I hope it helps someone with a similar problem. I think this solution is very ugly... If anyone has a better, more elegant solution, I'm all ears!

def solu(formula):
    final = []

    def diver(entr,mult=1):
        resul = list()
        # If modi is empty, it is an enclosed group
        # And we must multiply everything inside by modi
        if entr.modi != '':
            for y in entr:
                try:
                    resul.append(diver(y,entr.modi))
                except AttributeError:
                    pass
        # Else, it is just an atom, and we return it
        else:
            resul.append(entr.elem)
            resul.append(entr.esteq*mult)
        return resul

    def doubles(entr):
        resul = []
        # If entr does not contain lists
        # It is an atom
        if sum([1 for y in entr if isinstance(y,list)]) == 0:
            final.append(entr)
            return entr
        else:
            # And if it isn't an atom? We dive further
            # and call doubles until it is an atom
            for y in entr:
                doubles(y)


    for member in formula:
        # If member is already an atom, add it directly to final
        if sum([1 for x in diver(member) if isinstance(x,list)]) == 0:
            final.append(diver(member))
        else:
            # If not, call doubles on the clean member (without modi)
            # and it takes care of adding atoms to final
            doubles(diver(member))


    return final

Finally, solu does the trick:

>>> solu(formula.parseString('C6H8(OH)4'))
[['C', 6.0], ['H', 8.0], ['O', 4.0], ['H', 4.0]]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top