Scraping a page from a secure URL which is possibly using a session ID
-
25-09-2019 - |
Question
How to scrape a page like this: https://www.procom.ca/JobList.aspx?keywords=&Cities=&reference=&JobType=0
It is secure, and looks like it requires a referrer. I can't get anything using wget or httplib2.
If you go through this page, you get a list and it works on a browser but not the command line. https://www.procom.ca/jobsearch.aspx
I am interested in command line fetching.
Solution
As you suspect, it requires a referer. This works:
import urllib2
urlopen = urllib2.urlopen
Request = urllib2.Request
url = 'https://www.procom.ca/JobList.aspx?keywords=&Cities=&reference=&JobType=0'
headers = {'Referer' : 'http://www.stackoverflow.com'}
req = Request(url, None, headers)
handle = urlopen(req)
print handle.read()
OTHER TIPS
What data are you sending in POST or Get, I would recommend look thru the POST/GET messages in Firebug Net Panel, in that page there are many hidden values which I think are time dependent and changes on each page load and may be valid once so load page , get those values and send them with POST messages e.g. see these
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTEwODIzNjMxMzEPFgIeEUdyaWRTb3J0RGlyZWN0aW9uCyo..." />