Question

So I've scraped websites before, but this time I am stumped. I am attempting to search for a person on Biography.com and retrieve his/her biography. But whenever I search the site using urllib2 and query the URL: http://www.biography.com/search/ I get a blank page with no data in it.

When I look into the source generated in the browser by clicking View Source, I still do not see any data. When I use Chrome's developer tools, I find some data but still no links leading to the biography.

I have tried changing the User Agent, adding referrers, using cookies in Python but to no avail. If someone could help me out with this task it would be really helpful.

I am planning to use this text for my NLP project and worst case, I'll have to manually copy-paste the text. But I hope it doesn't come to that.

Était-ce utile?

La solution

Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.

https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0

The search term I used is in the q= part of the query string: q=Barack%20Obama.

This returns JSON inside of which there is a key link with the value of the article of interest's URL.

"link": "http://www.biography.com/people/barack-obama-12782369"

Visiting that page shows me that this is generated by a request to:

http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/@published/@by-custom-type/ContentPerson/@by-slug/barack-obama-12782369

which returns JSON containing HTML.

So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.

To implement:

  1. You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
  2. Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
  3. Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after @by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
  4. Parse the JSON from the response of this second request to extract the content you want.

(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)

Alternatively, you can use Selenium to visit the website, do the search and then extract the content.

Autres conseils

You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.

You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx brings you a json you can process searching for /people/ to retrive biography links. Or you can use an screen crawler like selenium

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top