Beautifulsoup and Soupstrainer for getting links dont work with hasattr, returning always true

https://stackoverflow.com/questions/17943992

04-06-2022
|

Pergunta

i am using Beautifulsoup4 and Soupstrainer with Python 3.3 for getting all links from a webpage. The following is the important code-snippet:

r = requests.get(adress, headers=headers)
for link in BeautifulSoup(r.text, parse_only=SoupStrainer('a')):
    if hasattr(link, 'href'):

I tested some webpages and it works very well but today when using

adress = 'http://www.goldentigercasino.de/'

I recognized that hasattr(link, 'href') always returns TRUE even when there is no such 'href' field, like in the goldentigercasino.de example. Because of that im getting troubles for late using link['href'] because its simply not there.

I also tried a workaround like this:

test = requests.get('http://www.goldentigercasino.de/')
for link in BeautifulSoup(test.text, parse_only=SoupStrainer('a',{'href': not None})):

That works as wanted Except that it also returns the Doctype:

HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"

Which is causing also trouble for the same reasons as above.

My question: Why does hasattr always returns true and how can I fix that? And if there is no possibility with hasattr, how can i fix my workaround that its not returning the DOCTYPE?

Many Thanks and best regards!

Solução

hasattr() is the wrong test; it tests if there is a a.href attribute, and BeautifulSoup dynamically turns attributes into searches for children. HTML tag attributes are not translated into Python attributes.

Use dictionary-style testing instead; you loop over all elements which can include the DocType instance, so I use getattr() to not break on objects that don't have attributes:

if 'href' in getattr(link, 'attrs', {}):

You can also instruct SoupStrainer to only match a tags with a href attribute by using href=True as a keyword argument filter (not None just means True in any case):

for link in BeautifulSoup(test.text, parse_only=SoupStrainer('a', href=True)):

This still includes the HTML declaration of course; search for just a links:

soup = BeautifulSoup(test.text, parse_only=SoupStrainer('a', href=True))
for link in soup.find_all('a'):
    print link

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow