Python scraper mechanize/javascript

Question 1

Ok this is a screwball approach. Playing around with the different search setting I found that the number of results to display is in the url. So I changed it to 3000 per page, thus it all fits on 1 page.

http://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0=0&higherOfficesServed=&lastName=&sex=Any&honors=&submit=Search&state=Any&college=&party=&inOffice=Any&biography=&race=Any&birthState=Any&religion=&militaryService=&firstName=&nbrterms=Any&warsServed=&&pagesizecac77e09-db17-41cb-9de0-687b843338d0=3000

After it lodes which does take a while I'd right click and go to view page source. Copy that into a text file on my computer. Then I can scrape the info I need from the file without going to the server and having to process the javascript.

May I recommend "BeautifulSoup" for getting around in the html file.

Question 2

I solve this problem with selenium. It is complete firefox(or another) browser, which you can manipulate in code.

Question 3

You can use PySide that is a binding for QtWebKit. With QtWebKit you can retrieve a page that uses Javascript and parse it once Javascript has populated the html. So you don't need to know about Javascript. Other alternatives are Selenium and PhantomJS.

Question 4

I would do that with phantomjs http://phantomjs.org/ (javascript) see https://github.com/ariya/phantomjs/wiki/Page-Automation

Question 5

Note that the select element on that page changes the window.location.

I think you can contruct an appropriate URI to load the page simply by replacing $('#pageSizeSelector....-..-..-..-....').val() with the value you need.