Question

Can anyone explain why this re (in Python):

pattern = re.compile(r"""
^
([[a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1}]+)
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+)   # Last word.
\.{1}                                                                                 
$
""", re.VERBOSE + re.UNICODE)

if re.match(pattern, line):

does not match "A sentence."

I would actually like to return the entire sentence (including the period) as a returned group (), but have been failing miserably.

No correct solution

OTHER TIPS

I think that maybe you meant to do this:

(([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+)
 ^                                             ^

I don't think the nested square brackets you had do what you think they do.

This regex works:

pattern = re.compile(r"""
^
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+)   # Last word.
\.{1}
$
""", re.VERBOSE + re.UNICODE)

line = "A sentence."

match = re.match(pattern, line)

>>> print "'%s'" % match.group(0)
'A sentence.'
>>> print "'%s'" % match.group(1)
'A '
>>> print "'%s'" % match.group(2)
'sentence'

To return the entire match (line in this case), use match.group(0).

Because the first match group can match multiple times (once for each word except the last one), you can only access the next to last word using match.group(1).

Btw, the {1} notation is not necessary in this case, matching once and only once is the default behavior, so this bit can be removed.

The extra set of square brackets definitely weren't helping you :)

It turns out the following actually works and includes all the extended ascii characters I wanted

^
([\w+\s{1}]+\w{1}\.{1})
$
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top