Question

I have to scrape all info for former US governors from this site. However, to read out the results and then follow the links, I need to access the different results pages, or, preferably, simply set the results limit shown per page to the maximum of 100 (I don't think there are more than 100 results for each state). However, the page info seems to use javascript, is not part of a form and it seems I cannot access it as a control.

Any info on how to proceed? I am pretty new to python, only use it for tasks like this from time to time. This is some simple code which iterates through the main form.

import mechanize
import lxml.html
import csv

site = "http://www.nga.org/cms/FormerGovBios"
output = csv.writer(open(r'output.csv','wb'))
br = mechanize.Browser()

response = br.open(site)
br.select_form(name="governorsSearchForm")
states = br.find_control(id="states-field", type="select").items
for pos, item in enumerate(states[1:2]): 
    statename = str([label.text for label in item.get_labels()])
    print pos, item.name, statename, len(states)
    br.select_form(name="governorsSearchForm")
    br["state"] = [item.name]
    response = br.submit(name="submit", type="submit")
    # now set page limit to 100, get links and descriptions\
    # and follow each link to get information
    for form in br.forms():
        print "Form name:", form.name
        print form, "\n"
    for link in br.links():
        print link.text, link.url
Was it helpful?

Solution 3

Ok this is a screwball approach. Playing around with the different search setting I found that the number of results to display is in the url. So I changed it to 3000 per page, thus it all fits on 1 page.

http://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0=0&higherOfficesServed=&lastName=&sex=Any&honors=&submit=Search&state=Any&college=&party=&inOffice=Any&biography=&race=Any&birthState=Any&religion=&militaryService=&firstName=&nbrterms=Any&warsServed=&&pagesizecac77e09-db17-41cb-9de0-687b843338d0=3000

After it lodes which does take a while I'd right click and go to view page source. Copy that into a text file on my computer. Then I can scrape the info I need from the file without going to the server and having to process the javascript.

May I recommend "BeautifulSoup" for getting around in the html file.

OTHER TIPS

I solve this problem with selenium. It is complete firefox(or another) browser, which you can manipulate in code.

You can use PySide that is a binding for QtWebKit. With QtWebKit you can retrieve a page that uses Javascript and parse it once Javascript has populated the html. So you don't need to know about Javascript. Other alternatives are Selenium and PhantomJS.

I would do that with phantomjs http://phantomjs.org/ (javascript) see https://github.com/ariya/phantomjs/wiki/Page-Automation

Note that the select element on that page changes the window.location.

I think you can contruct an appropriate URI to load the page simply by replacing $('#pageSizeSelector....-..-..-..-....').val() with the value you need.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top