Scraping Biography.com using urllib2

Question 1

Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.

https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0

The search term I used is in the q= part of the query string: q=Barack%20Obama.

This returns JSON inside of which there is a key link with the value of the article of interest's URL.

"link": "http://www.biography.com/people/barack-obama-12782369"

Visiting that page shows me that this is generated by a request to:

http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/@published/@by-custom-type/ContentPerson/@by-slug/barack-obama-12782369

which returns JSON containing HTML.

So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.

To implement:

You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after @by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
Parse the JSON from the response of this second request to extract the content you want.

(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)

Alternatively, you can use Selenium to visit the website, do the search and then extract the content.

Question 2

You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.

Question 3

You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx brings you a json you can process searching for /people/ to retrive biography links. Or you can use an screen crawler like selenium