Question

I'm trying to use pyparsing to parse a chemical formula that may be nested and with non-integer stoichiometries using pyparsing. What I want is a list of each element present in the formula and its correponding total stoichiometry.

I have used the example on the pyparsing wiki as a start, and looked at fourFn.py for more ideas. I'm having trouble understanding how to use all the features in the package.

I came up with the following grammar:

from pyparsing import Word, Group, ZeroOrMore, Combine,\
     Optional, OneOrMore, ParseException, Literal, nums,\
     Suppress, Dict, Forward

caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()

element = Word( caps, lowers )
separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )

nreal = (Combine( integer + Optional( separator +\
    Optional( integer ) ))\
    | Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )

block = Forward()
groupElem = Group( element + Optional( nreal, default=1)) ^ \
     Group( parl + block + parr + Optional( nreal,default=1 ) )
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )

Non-nested formulas work as expected:

>>> formula.parseString('H2O')
([(['H', 2.0], {}), (['O', 1], {})], {})

Despite having those empty fields (which I could not find the use), I can extract the information I want.

But when I try something like:

>>> formula.parseString('C6H8(OH)4')
([(['C', 6.0], {}), (['H', 8.0], {}), ([(['O', 1], {}), (['H', 1], {}), 4.0], {})], {})

I can see the formula is correctly parsed, but I would like the outher '4' in (OH)4 to multiply the inner numbers. But I can't see how to do it.

How can one token change the value of another one?

Or how can I walk these results and make a function that, if a block has an outer number attached to it, I can compute the total number of each element inside the block?

Thanks in advance.

edit1: I believe I need something like: suppress the outer nreal on occurrences of "( block )nreal", and multiply all occurrences of nreal by the outer value...

Était-ce utile?

La solution

Recursion is definitely needed to solve this. In pyparsing, you define a recursive grammar using the Forward class. See the annotations in this code sample:

from pyparsing import (Suppress, Word, nums, alphas, Regex, Forward, Group, 
                        Optional, OneOrMore, ParseResults)
from collections import defaultdict

"""
BNF for simple chemical formula (no nesting)

    integer :: '0'..'9'+
    element :: 'A'..'Z' 'a'..'z'*
    term :: element [integer]
    formula :: term+


BNF for nested chemical formula

    integer :: '0'..'9'+
    element :: 'A'..'Z' 'a'..'z'*
    term :: (element | '(' formula ')') [integer]
    formula :: term+

"""

LPAR,RPAR = map(Suppress,"()")
integer = Word(nums)

# add parse action to convert integers to ints, to support doing addition 
# and multiplication at parse time
integer.setParseAction(lambda t:int(t[0]))

element = Word(alphas.upper(), alphas.lower())
# or if you want to be more specific, use this Regex
# element = Regex(r"A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|"
#                 "G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|"
#                 "Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|"
#                 "Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]")

# forward declare 'formula' so it can be used in definition of 'term'
formula = Forward()

term = Group((element | Group(LPAR + formula + RPAR)("subgroup")) + 
                Optional(integer, default=1)("mult"))

# define contents of a formula as one or more terms
formula << OneOrMore(term)


# add parse actions for parse-time processing

# parse action to multiply out subgroups
def multiplyContents(tokens):
    t = tokens[0]
    # if these tokens contain a subgroup, then use multiplier to
    # extend counts of all elements in the subgroup
    if t.subgroup:
        mult = t.mult
        for term in t.subgroup:
            term[1] *= mult
        return t.subgroup
term.setParseAction(multiplyContents)

# add parse action to sum up multiple references to the same element
def sumByElement(tokens):
    elementsList = [t[0] for t in tokens]

    # construct set to see if there are duplicates
    duplicates = len(elementsList) > len(set(elementsList))

    # if there are duplicate element names, sum up by element and
    # return a new nested ParseResults
    if duplicates:
        ctr = defaultdict(int)
        for t in tokens:
            ctr[t[0]] += t[1]
        return ParseResults([ParseResults([k,v]) for k,v in ctr.items()])
formula.setParseAction(sumByElement)


# run some tests
tests = """\
    H
    NaCl
    HO
    H2O
    HOH
    (H2O)2
    (H2O)2OH
    ((H2O)2OH)12
    C6H5OH
    """.splitlines()
for t in tests:
    if t.strip():
        results = formula.parseString(t)
        print t, '->', dict(results.asList())

Prints out:

H -> {'H': 1}
NaCl -> {'Na': 1, 'Cl': 1}
HO -> {'H': 1, 'O': 1}
H2O -> {'H': 2, 'O': 1}
HOH -> {'H': 2, 'O': 1}
(H2O)2 -> {'H': 4, 'O': 2}
(H2O)2OH -> {'H': 5, 'O': 3}
((H2O)2OH)12 -> {'H': 60, 'O': 36}
C6H5OH -> {'H': 6, 'C': 6, 'O': 1}

Autres conseils

I guess I have found the solution myself. I had to create a recursive function to analyze the results and output the list as I wanted it, with each element and its stoichiometry without nesting. I had to modify slightly my starting code, and use named results for my purposes:

from pyparsing import Word, Group, ZeroOrMore, Combine,\
     Optional, OneOrMore, ParseException, Literal, nums,\
     Suppress, Dict, Forward

caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()

element = Word( caps, lowers )
separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )

nreal = (Combine( integer + Optional( separator +\
    Optional( integer ) ))\
    | Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )



block = Forward()
groupElem = (Group( element('elem') + Optional( nreal, default=1)('esteq') ))('dupla') | \
     Group( parl + block + parr + Optional( nreal,default=1 )('modi'))
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )

Here is my function. I hope it helps someone with a similar problem. I think this solution is very ugly... If anyone has a better, more elegant solution, I'm all ears!

def solu(formula):
    final = []

    def diver(entr,mult=1):
        resul = list()
        # If modi is empty, it is an enclosed group
        # And we must multiply everything inside by modi
        if entr.modi != '':
            for y in entr:
                try:
                    resul.append(diver(y,entr.modi))
                except AttributeError:
                    pass
        # Else, it is just an atom, and we return it
        else:
            resul.append(entr.elem)
            resul.append(entr.esteq*mult)
        return resul

    def doubles(entr):
        resul = []
        # If entr does not contain lists
        # It is an atom
        if sum([1 for y in entr if isinstance(y,list)]) == 0:
            final.append(entr)
            return entr
        else:
            # And if it isn't an atom? We dive further
            # and call doubles until it is an atom
            for y in entr:
                doubles(y)


    for member in formula:
        # If member is already an atom, add it directly to final
        if sum([1 for x in diver(member) if isinstance(x,list)]) == 0:
            final.append(diver(member))
        else:
            # If not, call doubles on the clean member (without modi)
            # and it takes care of adding atoms to final
            doubles(diver(member))


    return final

Finally, solu does the trick:

>>> solu(formula.parseString('C6H8(OH)4'))
[['C', 6.0], ['H', 8.0], ['O', 4.0], ['H', 4.0]]
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top