findall and regular expressions, getting the correct pattern

https://stackoverflow.com/questions/22387868

14-06-2023
|

Question

I'm working out of Magnus Lie Hetland's book, "Beginning Python" 2nd edition, and on page 244 he says the first pattern listed in my code should produce the desired output listed at the bottom of this code, but it doesn't. So I tried a couple of other patterns in order to try and get the desired output, but they don't work either. I checked the errata for the book and there are no corrections for this page. I'm using python 2.7.6. Any suggestions?

import re

s1 = 'http://www.python.org http://python.org www.python.org python.org .python.org ww.python.org w.python.org wwww.python.org'

# choose a pattern and comment out the other two

# output using Hetland's pattern
pat = r'(http://)?(www\.)?python\.org'
''' [('http://', 'www.'), ('http://', ''), ('', 'www.'), ('', ''), ('', ''), ('', ''), ('', ''), ('', 'www.')] '''

# output using this pattern
# pat = r'http://?www\.?python\.org'
''' ['http://www.python.org'] '''

# output using this pattern
# pat = r'http://?|www\.?|python\.org'
''' ['http://', 'www.', 'python.org', 'www.', 'http://', 'python.org', 'www.', 'python.org', 'python.org', 'python.org', 'python.org', 'python.org', 'www', 'python.org'] '''

print '\n', re.findall(pat, s1)

# desired output
''' ['http://www.python.org', 'http://python.org', 'www.python.org', 'python.org'] '''

Solution

The pattern works if you make the first two optional groups non-capture groups (?:...):

pat = r'(?:http://)?(?:www\.)?python\.org'
matches = re.findall(pat, s1)
# ['http://www.python.org', 'http://python.org', 'www.python.org', 'python.org', 'python.org', 'python.org', 'python.org', 'www.python.org']

That is, if that's the desired result - as the change to the pattern means there's only one capture group instead of three...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow