Question

I am having trouble with regular expressions.

I have:

urls = re.findall(r'href=[\'"]?([^\'" >]+)', line)
print urls

which gives me:

['production_r1499.log']
['production_r1499.log-20140323']
['production_r1499.log-20140323.gz']

I am only interested in the .log file. How do I get the regex to only match this one?

alternatively. Could some approach similar to this work?

if str(urls).endswith('.log'):

Happy and grateful for suggestions!

Was it helpful?

Solution

Try this.

urls = re.findall(r'href=[\'"]?([^\'" >]+\.log)', line)

Strictly speaking, this should be anchored to the end of the href attribute. If you are concerned about false positives, maybe add something like [\'">] before the closing quote.

OTHER TIPS

Use lookahead to see whether there are any ", ', > or space after the .log in your match.

urls = re.findall(r'href=[\'"]?([^\'" >]+\.log)(?=[\'"> ])', line)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top