Why is my link extraction not working?

https://stackoverflow.com/questions/18289295

24-06-2022
|

题

I am looking to learn Beautiful Soup, and trying to extract all links from the page http://www.popsci.com ... but I am getting a syntax error.

This code should work, but it doesn't for any page I try it on. I am trying to find out why exactly it is not working.

Here's my code:

from BeautifulSoup import BeautifulSoup
import urllib2

url="http://www.popsci.com/"

page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

sci=soup.findAll('a')

for eachsci in sci:
    print eachsci['href']+","+eachsci.string

... and here's the error I get:

Traceback (most recent call last):
  File "/root/Desktop/3.py", line 12, in <module>
    print eachsci['href']+","+eachsci.string
TypeError: coercing to Unicode: need string or buffer, NoneType found
[Finished in 1.3s with exit code 1]

解决方案

When the a element contains no text, eachsci.string is None - and you can't concatenate None with a string using the + operator, as you're trying to do.

If you replace eachsci.string with eachsci.text, that error is solved, because eachsci.text contains the empty string '' when the a element is empty, and there's no problem concatenating that with another string.

However, you'll run into another problem when you hit an a element with no href attribute - when that happens, you'll get a KeyError.

You can solve that using dict.get(), which is able to return a default value if a key isn't in a dictionary (the a element is pretending to be a dictionary, so this works).

Putting all that together, here's a variation on your for loop that works:

for eachsci in sci:
    print eachsci.get('href', '[no href found]') + "," + eachsci.text

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow