pyparsing - parse xml comment

https://stackoverflow.com/questions/7825030

27-10-2019
|

Question

I need to parse a file containing xml comments. Specifically it's a c# file using the MS /// convention.

From this I'd need to pull out foobar, or /// foobar would be acceptable, too. (Note - this still doesn't work if you make the xml all on one line...)

testStr = """
    ///<summary>
    /// foobar
    ///</summary>
    """

Here is what I have:

import pyparsing as pp

_eol = pp.Literal("\n").suppress()
_cPoundOpenXmlComment = Suppress('///<summary>') + pp.SkipTo(_eol)
_cPoundCloseXmlComment = Suppress('///</summary>') + pp.SkipTo(_eol)
_xmlCommentTxt = ~_cPoundCloseXmlComment + pp.SkipTo(_eol)
xmlComment = _cPoundOpenXmlComment + pp.OneOrMore(_xmlCommentTxt) + _cPoundCloseXmlComment

match = xmlComment.scanString(testStr)

and to output:

for item,start,stop in match:
    for entry in item:
        print(entry)

But I haven't had much success with the grammer working across multi-line.

(note - I tested the above sample in python 3.2; it works but (per my issue) does not print any values)

Thanks!

Solution

How about using nestedExpr:

import pyparsing as pp

text = '''\
///<summary>
/// foobar
///</summary>
blah blah
///<summary> /// bar ///</summary>
///<summary>  ///<summary> /// baz  ///</summary> ///</summary>    
'''

comment=pp.nestedExpr("///<summary>","///</summary>")
for match in comment.searchString(text):
    print(match)
    # [['///', 'foobar']]
    # [['///', 'bar']]
    # [[['///', 'baz']]]

OTHER TIPS

I think Literal('\n') is your problem. You don't want to build a Literal with whitespace characters (since Literals by default skip over whitespace before trying to match). Try using LineEnd() instead.

EDIT 1: Just because you get an infinite loop with LineEnd doesn't mean that Literal('\n') is any better. Try adding .setDebug() on the end of your _eol definition, and you'll see that it never matches anything.

Instead of trying to define the body of your comment as "one or more lines that are not a closing line, but get everything up to the end-of-line", what if you just do:

xmlComment = _cPoundOpenXmlComment + pp.SkipTo(_cPoundCloseXmlComment) + _cPoundCloseXmlComment

(The reason you were getting an infinite loop with LineEnd() was that you were essentially doing OneOrMore(SkipTo(LineEnd())), but never consuming the LineEnd(), so the OneOrMore just kept matching and matching and matching, parsing and returning an empty string since the parsing position was at the end of line.)

You could use an xml parser to parse xml. It should be easy to extract relevant comment lines:

import re
from xml.etree import cElementTree as etree

# extract all /// lines
lines = re.findall(r'^\s*///(.*)', text, re.MULTILINE)

# parse xml
root = etree.fromstring('<root>%s</root>' % ''.join(lines))
print root.findtext('summary')
# -> foobar

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow