Question

I'm having some trouble working the logic of regular expressions in Python. I would like to write a regular expression that doesn't return a match if the string ends in a substring. Ultimately I'm trying to exclude any links to binary files that I find in the href attribute of <a> tags. (This is being implemented in Scrapy)

My issue is that if my regular expression is [^ \t\n\r\f\v]+[\/]?(?<!.pdf) and it finds a link to someDocument.pdf it returns someDocument.pd

How can I prevent from returning any match at all if it discovers that string?

Was it helpful?

Solution

If you are using from scrapy, then you may need to add a $ at the end of your regex:

[^ \t\n\r\f\v]+[\/]?(?<!\.pdf)$

If there is any way to use BeautifulSoup from your project, then try it:

htmls = '''<a href="adssad/asdasd/asd.pdf">M</a> <a href='asdasdasdas/asdasd/asdasd.doc'></a>'''
soup = BeautifulSoup(htmls)
for link in soup.findAll("a", {"href":re.compile("(?<!\.pdf)$")}):
    print link['href']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top