Non greedy dotall regex in Python

https://stackoverflow.com//questions/25055497

21-12-2019
|

Question

I need to parse annotations of methods written in PHP. I wrote a regex (see simplified example below) to search them but it doesn't work as expected. Instead of matching the shortest part of text between /** and */, it matches the maximum amount of source code (previous methods with annotations). I'm sure I'm using the correct .*? non greedy version of * and I have found no evidence DOTALL turns it off. Where could be the problem, please? Thank you.

p = re.compile(r'(?:/\*\*.*?\*/)\n\s*public', re.DOTALL)
methods = p.findall(text)

Solution

Regex engines parse from left to right. A lazy quantifier will attempt to match the least it can from the current match position, but it can't push the match start forward, even if that would reduce the amount of text matched. That means rather than starting at the last /** before the public, it's going to match from the first /** to the next */ that's attached to a public.

If you want to exclude */ from inside the comment, you'll need to group the . with a lookahead assertion:

(?:(?!\*/).)

The (?!\*/) asserts that the character we're matching is not the start of a */ sequence.

OTHER TIPS

I think you're trying to get this,

>>> text = """ /** * comment */ class MyClass extens Base { /** * comment */ public function xyz """
>>> m = re.findall(r'\/\*\*(?:(?!\*\/).)*\*\/\s*public', text, re.DOTALL)
>>> m
['/** * comment */ public']

If you don't want public in the final match then use the below regex which uses positive lookahead,

>>> m = re.findall(r'\/\*\*(?:(?!\*\/).)*\*\/(?=\s*public)', text, re.DOTALL)
>>> m
['/** * comment */']

You should be able to use this:

\/\*\*([^*]|\*[^/])*?\*\/\s*public

That will match any symbol that isn't an asterix (*), and if is an asterix it's not allowed to be followed by a forward slash. Meaning it should only capture comments that are closed just before public and not sooner.

Example: http://regexr.com/398b3

Explanation: http://tinyurl.com/lcewdmo

Disclaimer: If the comment contains */ inside it, this won't work.

# Some examples and assuming that the annotation you want to parse
# starts with a /** and ends with a */.  This may be spread over
# several lines.

text = """
/**
 @Title(value='Welcome', lang='en')
 @Title(value='Wilkommen', lang='de')
 @Title(value='Vitajte', lang='sk')
 @Snippet
    ,*/
class WelcomeScreen {}

   /** @Target("method") */
  class Route extends Annotation {}

/** @Mapping(inheritance = @SingleTableInheritance,
    columns = {@ColumnMapping('id'), @ColumnMapping('name')}) */
public Person {}

"""

text2 = """ /** * comment */
CLASS MyClass extens Base {

/** * comment */
public function xyz
"""


import re

# Match a PHP annotation and the word following class or public
# function.
annotations = re.findall(r"""/\*\*             # Starting annotation
                                               # 
                            (?P<annote>.*?)    # Namned, non-greedy match
                                               # including newline
                                               #
                             \*/               # Ending annotation
                                               #
                             (?:.*?)           # Non-capturing non-greedy
                                               # including newline
                 (?:public[ ]+function|class)  # Match either
                                               # of these
                             [ ]+              # One or more spaces
                             (?P<name>\w+)     # Match a word
                         """,
                         text + text2,
                         re.VERBOSE | re.DOTALL | re.IGNORECASE)

for txt in annotations:
     print("Annotation: "," ".join(txt[0].split()))
     print("Name: ", txt[1])

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow