Question

I'm trying to parse some text using PyParser. The problem is that I have names that can contain white spaces. So my input might look like this. First, a list of names:

Joe
bob
Jimmy X
grjiaer-rreaijgr Y

Then, things they do:

Joe A
bob B
Jimmy X C

the problem of course is that a thing they do can be the same as the end of the name:

Jimmy X X
grjiaer-rreaijgr Y Y

How can I create a parser for the action lines? The output of parsing Joe A should be [Joe, A]. The output of parsing Jimmy X C should be [Jimmy X, C], of Jimmy X X - [Jimmy X, X]. That is, [name, action] pairs.

If I create my name parser naively, meaning something like OneOrMore(RegEx("\S*")), then it will match the entire line giving me [Jimmy X X] followed by a parsing error for not seeing an action (since it was already consumed by the name parser).

NOTE: Sorry for the ambiguous phrasing earlier that made this look like an NLP question.

Was it helpful?

Solution

Have fun:

from pyparsing import Regex, oneOf

THE_NAMES = \
"""Joe
bob
Jimmy X
grjiaer-rreaijgr Y
"""

THE_THINGS_THEY_DO = \
"""Joe A
bob B
Jimmy X C
Jimmy X X
grjiaer-rreaijgr Y Y
"""

ACTION = Regex('.*')
NAMES = THE_NAMES.splitlines()
print NAMES
GRAMMAR = oneOf(NAMES) + ACTION    
for line in THE_THINGS_THEY_DO.splitlines():
    print GRAMMAR.parseString(line)

OTHER TIPS

You pretty much need more than a simple parser. Parsers use the symbols in a string to define which pieces of the string represent different elements of a grammar. This is why FM asked for some clue to indicate how you know what part is the name and what part is the rest of the sentence. If you could say that names are made up of one or more capitalized words, then the parser would know when the name stops and the rest of the sentence starts.

But a name like "jimmy foo decides"? How can the parser know just by looking at the symbols in "decides" whether "decides" is or is not part of the name? Even a human reading your "jimmy foo decides decides to eat" sentence would have some trouble determining where the name starts or stops, and whether this was some sort of typo.

If your input is really this unpredictable, then you need to use a tool such as the NLTK (Natural Language Toolkit). I've not used it myself, but it approaches this problem from the standpoint of parsing sentences in a language, as opposed to trying to parse structured data or mathematical formats.

I would not recommend pyparsing for this kind of language interpretation.

Looks like you need nltk, not pyparsing. Looks like you need a tractable problem to work on. How do YOU know how to parse 'jimmy foo decides decides to eat'? What rules do YOU use to deduce (contrary to what most people would assume) that "decides decides" is not a typo?

Re "names that can contain whitespaces": Firstly, I'd hope that you'd normalise that into one space. Secondly: this is unexpected?? Thirdly: names can contain apostrophes and hyphens (O'Brien, Montagu-Douglas-Scott) and may have components that aren't capitalised e.g. Georg von und zu Hohenlohe) and we won't mention Unicode.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top