Pergunta

I want to extract the infobox block from Wikipedia. Below is a sample input file:

{{some text}}
some other text
{{Infobox President
birth|d/m/y
other_inner_text:{{may contain curly bracket}}
other text}}
some other text
or even another infobox
{{Infobox Cabinet
same structure
{{text}}also can contain {{}}
}}
can be some other text...

I want the parsing result to return the two Infobox blocks:

{{Infobox President
birth|d/m/y
other_inner_text:{{may contain curly bracket}}
other text
}}

and

{{Infobox Cabinet
same structure
{{text}}also can contain {{}}
}}

Any one know how to use regular expression in python to achieve this?

Foi útil?

Solução

Regex

{{Infobox(?:(?!}}|{{).)*(?:{{(?:(?!}}|{{).)*}}(?:(?!}}|{{).)*)*.*?}}

And my try at Perl which I'm not fluent at

while ($subject =~ m/\{\{Infobox(?:(?!\}\}|\{\{).)*(?:\{\{(?:(?!\}\}|\{\{).)*\}\}(?:(?!\}\}|\{\{).)*)*.*?\}\}/sg) {
    # matched text = $&
}

It will work on an unlimited pair of "{{ some text }}" as long as they are balanced. It does not support nested text of that pair but it wasn't required.

Note that it's maybe better to look for an alternative solution if this is not used in a 1 time only solution. Maintaining such a regex is brutal.

Outras dicas

To match nested structures, some regexp dialects provide recursive patterns like (?R). The (?R) thing basically says "something that this expression matches".

Standard python re doesn't support this, but the newer regex module, which eventually will replace re, does. Here's a complete example.

text = """
{{some text}}
some other text
{{Infobox President
birth|d/m/y
other_inner_text:{{may contain {curly} bracket}}
other text}}
some other text
or even another infobox
{{Infobox Cabinet
same structure
{{text}}also can contain {{}}
}}
can be some other text...
"""

import regex

rx = r"""
{{                    # open
(                     # this match
    (?:               # contains...
        [^{}]         # no brackets
        |             # or
        }[^}]         # single close bracket
        |             # or
        {[^{]         # single open bracket
        |             # or
        (?R)          # the whole expression once again <-- recursion!
    )*                # zero or more times
)                     # end of match
}}                    # close
"""

rx = regex.compile(rx, regex.X | regex.S)

for p in rx.findall(text):
    print 'FOUND: (((', p, ')))'

Result:

FOUND: ((( some text )))
FOUND: ((( Infobox President
birth|d/m/y
other_inner_text:{{may contain {curly} bracket}}
other text )))
FOUND: ((( Infobox Cabinet
same structure
{{text}}also can contain {{}}
)))

For a great explanation of recursive regexps see this blog entry.

enter image description here

(couldn't resist stealing this one).


That said, you'd be probably better off with a parser-based solution. See for example parsing nested expressions with pyparsing.

It's not python, but this answer may help you. It even includes a (not quick but dirty) regex that can handle one-level nested templates.

The general answer is no, regexes can't parse nested structures. See the linked answer for how to obtain a parsetree from a mediawiki api.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top