Question

I am trying to submit a form on the webpage and get to the html of the next page after the form is submitted. I found two ways of doing this using either requests or liburl.

import urllib
import urllib2
import webbrowser

url = 'https://fcq.colorado.edu/UCBdata.htm'

def main():
    data = urllib.urlencode({'subj': 'CSCI', 'crse': '1300'})
    results = urllib2.urlopen(url, data)
    print results.read()
    with open("results.html", "w") as f:
        f.write(results.read())

    webbrowser.open("results.html")
    return 0

 if __name__ == '__main__':
    main()

or:

import requests

url = 'https://fcq.colorado.edu/UCBdata.htm'

def main():     
    payload = {'subj': 'CSCI', 'crse': '1300'}
    r = requests.post(url, payload)
    with open("requests_results.html", "w") as f:
        f.write(r.content)  
    return 0

if __name__ == '__main__':
    main()

When I get the page after the request it is just the same page that has the form on it for both methods. I was wondering if I maybe had to do something with the submit button? I am new to web stuff and python so any tips or ideas would be greatly appreciated. Thanks!

Here is the html of the submit button:

<input type="submit" name="sub" value="Submit Request" onclick="this.disabled=true,this.form.submit();">

Était-ce utile?

La solution

The form on that page is will actually be submitted by JavaScript, so just looking at the <form /> element is not (necessarily) enough. You can use e.g. Firebug's network tab or the Chrome developer tools to inspect the POST request after you submit the form in order to see what's actually submitted.

This seems to work:

import requests

url = 'https://fcq.colorado.edu/scripts/broker.exe'

payload = {
    "_PROGRAM": "fcqlib.fcqdata.sas",
    "_SERVICE": "fcq",
    "camp": "BD",
    "fileFrmt": "HTM",
    "ftrm": "1",
    "fyr": "2007",
    "grp1": "ALL",
    "jjj": "mytst",
    "ltrm": "7",
    "lyr": "2013",
    "sort": "descending YEARTERM SUBJECT COURSE SECTION",
}

payload.update({
    'subj': 'CSCI',
    'crse': '1300',
})


def main():
    r = requests.post(url, payload)
    with open("requests_results.html", "w") as f:
        f.write(r.content)
    return 0

if __name__ == '__main__':
    main()

Autres conseils

The target URL of the form is https://fcq.colorado.edu/scripts/broker.exe (see the action attribute of the <form> tag). So you need to replace:

url = 'https://fcq.colorado.edu/UCBdata.htm'

with

url = 'https://fcq.colorado.edu/scripts/broker.exe'

I don't know if you're still interested, but I actually did something similar with the exact same site, and had to deal with the exact same problem. I used the Mechanize and Requests library to build a Python scraper API of sorts.

You can see my code for it on github, and I welcome any pull requests if you feel like you can do something better.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top