Beautifulsoup functionality not working properly in specific scenario

https://stackoverflow.com/questions/16134384

11-04-2022
|

Pergunta

I am trying to read in the following url using urllib2: http://frcwest.com/ and then search the data for the meta redirect.

It reads the following data in:

   <!--?xml version="1.0" encoding="UTF-8"?--><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
   <html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta content="0;url= Home.html" http-equiv="refresh"/></head><body></body></html>

Reading it into Beautifulsoup works fine. However for some reason none of the functionality works for this specific senarious, and I don't understand why. Beautifulsoup has worked great for me in all other scenarios. However, when simply trying:

    soup.findAll('meta')

produces no results.

My eventual goal is to run:

    soup.find("meta",attrs={"http-equiv":"refresh"})

But if:

    soup.findAll('meta')

isn't even working then I'm stuck. Any incite into this mystery would be appreciated, thanks!

Solução

It's the comment and doctype that throws the parser here, and subsequently, BeautifulSoup.

Even the HTML tag seems 'gone':

>>> soup.find('html') is None
True

Yet it is there in the .contents iterable still. You can find things again with:

for elem in soup:
    if getattr(elem, 'name', None) == u'html':
        soup = elem
        break

soup.find_all('meta')

Demo:

>>> for elem in soup:
...     if getattr(elem, 'name', None) == u'html':
...         soup = elem
...         break
... 
>>> soup.find_all('meta')
[<meta content="0;url= Home.html" http-equiv="refresh"/>]

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow