strange behavior of parenthesis in python regex

https://stackoverflow.com//questions/11703573

13-12-2019
|

Question

I'm writing a python regex that looks through a text document for quoted strings (quotes of airline pilots recorded from blackboxes). I started by trying to write a regex with the following rules:

Return what is between quotes.
if it opens with single, only return if it closes with single.
if it opens with double, only return if it closes with double.

For instance I don't want to match "hi there', or 'hi there", but "hi there" and 'hi there'.

I use a testing page which contains things like:

CA  "Runway 18, wind 230 degrees, five knots, altimeter 30."
AA  "Roger that"
18:24:10 [flap lever moving into detent]
ST: "Some passenger's pushing a switch. May I?"

So I decided to start simple:

 re.findall('("|\').*?\\1', page)
 ########## /("|').*?\1/ <-- raw regex I think I'm going for.

This regex acts very unexpectedly.
I thought it would:

( " | " ) Match EITHER single OR double quotes, save as back reference /1.
.*? Match non-greedy wildcard.
\1 Match whatever it finds in back reference \1 (step one).

Instead, it returns an array of quotes but never anything else.

['"', '"', "'", "'"]

I'm really confused because the equivalent (afaik) regex works just fine in VIM.

\("\|'\).\{-}\1/)

My question is this:
Why does it return only what is inside parenthesis as the match? Is this a flaw in my understanding of back references? If so then why does it work in VIM?

And how do I write the regex I'm looking for in python?

Thank you for your help!

Solution

Read the documentation. re.findall returns the groups, if there are any. If you want the entire match you must group it all, or use re.finditer. See this question.

OTHER TIPS

You aren't capturing anything except for the quotes, which is what Python is returning.

If you add another group, things work much better:

for quote, match in re.finditer(r'("|\')(.*?)\1', page):
  print match

I prefixed your string literal with an r to make it a raw string, which is useful when you need to use a ton of backslashes (\\1 becomes \1).

You need to catch everything with an extra pair of parentheses.

re.findall('(("|\').*?\\2)', page)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow