Question

I've rechecked my code and looked at comparable operations on opening a URL to pass web data into Beautiful Soup, for some reason my code just doesn't return anything although it's in correct form:

>>> from bs4 import BeautifulSoup

>>> from urllib3 import poolmanager

>>> connectBuilder = poolmanager.PoolManager()

>>> content = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')

>>> content
<urllib3.response.HTTPResponse object at 0x00000000032EC390>

>>> soup = BeautifulSoup(content)

>>> soup.title
>>> soup.title.name
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'name'
>>> soup.p
>>> soup.get_text()
''

>>> content.data
a stream of data follows...

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content, it makes sense that it can read the status of the response, but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup). You can see that I've tried to read a few tags and text, the get_text() returns an empty list, this is strange.

Strangely, when I access the web data via content.data, the data shows up but it's not useful since I can't use Beautiful Soup to parse it. What is my problem? Thanks.

Was it helpful?

Solution

If you just want to scrape the page, requests will get the content you need:

from bs4 import BeautifulSoup

import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)

In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [60]: soup.title.name
Out[60]: 'title'

OTHER TIPS

urllib3 returns a Response object, which contains the .data which has the preloaded body payload.

Per the top quickstart usage example here, I would do something like this:

import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/')

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.data)  # Note the use of the .data property
...

The rest should work as intended.

--

A little about what went wrong in your original code:

You passed the entire response object rather than the body payload. This should normally be fine because the response object is a file-like object, except in this case urllib3 already consumes all of the response and parses it for you, so that there is nothing left to .read(). It's like passing a filepointer which has already been read. .data on the other hand will access the already-read data.

If you want to use urllib3 response objects as file-like objects, you'll need to disable content preloading, like this:

response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/', preload_content=False)
soup = BeautifulSoup(response)  # We can pass the original `response` object now.

Now it should work as you expected.

I understand that this is not very obvious behaviour, and as the author of urllib3 I apologize. :) We plan to make preload_content=False the default someday. Perhaps someday soon (I opened an issue here).

--

A quick note on .urlopen vs .request:

.urlopen assumes that you will take care of encoding any parameters passed to the request. In this case it's fine to use .urlopen because you're not passing any parameters to the request, but in general .request will do all the extra work for you so it's more convenient.

If anyone would be up for improving our documentation to this effect, that would be greatly appreciated. :) Please send a PR to https://github.com/shazow/urllib3 and add yourself as a contributor!

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content…

What you've called content isn't the content, but a file-like object that you can read the content from. BeautifulSoup is perfectly happy taking such a thing, but it's not very helpful to print it out for debugging purposes. So, let's actually read the content out of it to make this easier to debug:

>>> response = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')
>>> response
<urllib3.response.HTTPResponse object at 0x00000000032EC390>
>>> content = response.read()
>>> content
b''

This should make it pretty clear that BeautifulSoup is not the problem here. But continuing on:

… but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup).

Yes it does. The fact that soup.title gave you None instead of raising an AttributeError is pretty good evidence, but you can test it directly:

>>> type(soup)
bs4.BeautifulSoup

That's definitely a BeautifulSoup object.

When you pass BeautifulSoup an empty string, exactly what you get back will depend on which parser it's using under the covers; if it's relying on the Python 3.x stdlib, what you'll get is an html node with an empty head, and empty body, and nothing else. So, when you look for a title node, there isn't one, and you get None.


So, how do you fix this?

As the documentation says, you're using "the lowest level call for making a request, so you’ll need to specify all the raw details." What are those raw details? Honestly, if you don't already know, you shouldn't be using this method Teaching you how to deal with the under-the-hood details of urllib3 before you even know the basics would not be doing you a service.

In fact, you really don't need urllib3 here at all. Just use the modules that come with Python:

>>> # on Python 2.x, instead do: from urllib2 import urlopen 
>>> from urllib.request import urlopen
>>> r = urlopen('http://www.crummy.com/software/BeautifulSoup/')
>>> soup = BeautifulSoup(r)
>>> soup.title.text
'Beautiful Soup: We called him Tortoise because he taught us.'

My beautiful soup code was working in one environment (my local machine) and returning an empty list in another one (ubuntu 14 server).

I've resolved my problem changing the installation. details in other thread:

Html parsing with Beautiful Soup returns empty list

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top