Pyparsing - where order of tokens in unpredictable

https://stackoverflow.com/questions/2134416

22-09-2019
|

Question

I want to be able to pull out the type and count of letters from a piece of text where the letters could be in any order. There is some other parsing going on which I have working, but this bit has me stumped!

input -> result
"abc" -> [['a',1], ['b',1],['c',1]]
"bbbc" -> [['b',3],['c',1]]
"cccaa" -> [['a',2],['c',3]]

I could use search or scan and repeat for each possible letter, but is there a clean way of doing it?

This is as far as I got:

from pyparsing import *


def handleStuff(string, location, tokens):

        return [tokens[0][0], len(tokens[0])]


stype = Word("abc").setParseAction(handleStuff)
section =  ZeroOrMore(stype("stype"))


print section.parseString("abc").dump()
print section.parseString("aabcc").dump()
print section.parseString("bbaaa").dump()

Solution

I wasn't clear from your description whether the input characters could be mixed like "ababc", since in all your test cases, the letters were always grouped together. If the letters are always grouped together, you could use this pyparsing code:

def makeExpr(ch):
    expr = Word(ch).setParseAction(lambda tokens: [ch,len(tokens[0])])
    return expr

expr = Each([Optional(makeExpr(ch)) for ch in "abc"])

for t in tests:
    print t,expr.parseString(t).asList()

The Each construct takes care of matching out of order, and Word(ch) handles the 1-to-n repetition. The parse action takes care of converting the parsed tokens into the (character, count) tuples.

OTHER TIPS

One solution:

text = 'sufja srfjhvlasfjkhv lasjfvhslfjkv hlskjfvh slfkjvhslk'
print([(x,text.count(x)) for x in set(text)])

No pyparsing involved, but it seems like overkill.

I like Lennart's one-line solution.

Alex mentions another great option if you're using 3.1

Yet another option is collections.defaultdict:

>>> from collections import defaultdict
>>> mydict = defaultdict(int)
>>> for c in 'bbbc':
...   mydict[c] += 1
...
>>> mydict
defaultdict(<type 'int'>, {'c': 1, 'b': 3})

If you want a pure-pyparsing approach, this feels about right:

from pyparsing import *

# lambda to define expressions
def makeExpr(ch):
    expr = Literal(ch).setResultsName(ch, listAllMatches=True)
    return expr

expr = OneOrMore(MatchFirst(makeExpr(c) for c in "abc"))
expr.setParseAction(lambda tokens: [[a,len(b)] for a,b in tokens.items()])


tests = """\
abc
bbbc
cccaa
""".splitlines()

for t in tests:
    print t,expr.parseString(t).asList()

Prints:

abc [['a', 1], ['c', 1], ['b', 1]]
bbbc [['c', 1], ['b', 3]]
cccaa [['a', 2], ['c', 3]]

But this starts to get into an obscure code area, since it relies on some of the more arcane features of pyparsing. In general, I like frequency counters that use defaultdict (haven't tried Counter yet), since it's pretty clear just what you are doing.

pyparsing apart -- in Python 3.1, collections.Counter makes such counting tasks really easy. A good version of Counter for Python 2 can be found here.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow