Question

I'm trying to parse this site and for reasons I can't understand, nothing is happening.

url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
response = urllib2.urlopen(url).read()
doc = BeautifulSoup(response)
divs = doc.findAll('div')
print len(divs) # prints 0.

This site is a real-state ads in Rio de Janeiro, Brazil. I can't find anything in the html source that could prevents Beautifulsoup of working. Would it be the size?

I'm using Enthought Canopy Python 2.7.6, IPython Notebook 2.0, Beautifulsoup 4.3.2.

Was it helpful?

Solution

This is because you are letting BeautifulSoup to choose the best suitable parser for you. And, this really depends on what modules are installed in your python environment.

According to the documentation:

The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed.

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

So, different parsers - different results:

>>> from bs4 import BeautifulSoup
>>> url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
>>> import urllib2
>>> response = urllib2.urlopen(url).read()
>>> len(BeautifulSoup(response, 'lxml').find_all('div'))
558
>>> len(BeautifulSoup(response, 'html.parser').find_all('div'))
558
>>> len(BeautifulSoup(response, 'html5lib').find_all('div'))
0

The solution for you would be to specify a parser that can handle parsing of this particular page, you may need to install lxml or html5lib.

Also see: Differences between parsers.

OTHER TIPS

Something is wrong with your environment, Here is the output I get:

>>> url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> response = urllib2.urlopen(url).read()
>>> doc = BeautifulSoup(response)
>>> divs = doc.findAll('div')
>>> print len(divs) # prints 0.
558
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top