I have to scrape all info for former US governors from this site. However, to read out the results and then follow the links, I need to access the different results pages, or, preferably, simply set the results limit shown per page to the maximum of 100 (I don't think there are more than 100 results for each state). However, the page info seems to use javascript, is not part of a form and it seems I cannot access it as a control.
Any info on how to proceed? I am pretty new to python, only use it for tasks like this from time to time. This is some simple code which iterates through the main form.
import mechanize
import lxml.html
import csv
site = "http://www.nga.org/cms/FormerGovBios"
output = csv.writer(open(r'output.csv','wb'))
br = mechanize.Browser()
response = br.open(site)
br.select_form(name="governorsSearchForm")
states = br.find_control(id="states-field", type="select").items
for pos, item in enumerate(states[1:2]):
statename = str([label.text for label in item.get_labels()])
print pos, item.name, statename, len(states)
br.select_form(name="governorsSearchForm")
br["state"] = [item.name]
response = br.submit(name="submit", type="submit")
# now set page limit to 100, get links and descriptions\
# and follow each link to get information
for form in br.forms():
print "Form name:", form.name
print form, "\n"
for link in br.links():
print link.text, link.url