Pergunta

I am trying to download the CSV files from this page, via a python script.

But when I try to access the CSV file directly by links in my browser, an agreement form is displayed. I have to agree to this form before I am allowed to download the file.

The exact URLs to the csv files can't be retrieved. It is a value being sent to backend db which fetches the file - e.g PERIOD_ID=2013-0:

https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0

I've tried urllib2.open() and urllib2.read(), but it leads to the html content of agreement form, not the file content.

How do i write a python code which handles this re-direct and then fetches me the CSV file and let me save on disk ?

Foi útil?

Solução 2

Here's my suggestion, for automatically applying the server cookies and basically mimicking standard client session behavior.

(Shamelessly inspired by @pope's answer 554580.)

import urllib2
import urllib
from lxml import etree

_TARGET_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'
_AGREEMENT_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/Welcome/Agreement.aspx'
_CSV_OUTPUT = 'urllib2_ProdExport2013-0.csv'


class _MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):

    def http_error_302(self, req, fp, code, msg, headers):
        print 'Follow redirect...'  # Any cookie manipulation in-between redirects should be implemented here.
        return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)

    http_error_301 = http_error_303 = http_error_307 = http_error_302

cookie_processor = urllib2.HTTPCookieProcessor()

opener = urllib2.build_opener(_MyHTTPRedirectHandler, cookie_processor)
urllib2.install_opener(opener)

response_html = urllib2.urlopen(_TARGET_URL).read()

print 'Cookies collected:', cookie_processor.cookiejar

page_node, submit_form = etree.HTML(response_html), {}  # ElementTree node + dict for storing hidden input fields.
for input_name in ['ctl00$MainContent$AgreeButton', '__EVENTVALIDATION', '__VIEWSTATE']:  # Form `input` fields used on the ``Agreement.aspx`` page.
    submit_form[input_name] = page_node.xpath('//input[@name="%s"][1]' % input_name)[0].attrib['value']
    print 'Form input \'%s\' found (value: \'%s\')' % (input_name, submit_form[input_name])

# Submits the agreement form back to ``_AGREEMENT_URL``, which redirects to the CSV download at ``_TARGET_URL``.
csv_output = opener.open(_AGREEMENT_URL, data=urllib.urlencode(submit_form)).read()
print csv_output

with file(_CSV_OUTPUT, 'wb') as f:  # Dumps the CSV output to ``_CSV_OUTPUT``.
    f.write(csv_output)
    f.close()

Good luck!

[Edit]

On the why of things, I think @Steinar Lima is correct with respect to requiring a session cookie. Though unless you've already visited the Agreement.aspx page and submitted a response via the provider's website, the cookie you copy from the browser's web inspector will only result in another redirect to the Welcome to the PA DEP Oil & Gas Reporting Website welcome page. Which of course eliminates the whole point of having a Python script do the job for you.

Outras dicas

You need to set the ASP.NET_SessionId cookie. You can find this by using Chrome's Inspect element option in the context menu, or by using Firefox and the Firebug extension.

With Chrome:

  1. Right-click on the webpage (after you've agreed to the terms) and select Inspect element
  2. Click Resources -> Cookies
  3. Select the only element in the list
  4. Copy the Value of the ASP.NET_SessionId element

With Firebug:

  1. Right-click on the webpage (after you've agreed to the terms), and click *Inspect Element with Firebug
  2. Click Cookies
  3. Copy the Value of the ASP.NET_SessionId element

In my case, I got ihbjzynwfcfvq4nzkncbviou - it might work for you, if not you need to perform the above procedure.

Add the cookie to your request, and download the file using the requests module (based on an answer by eladc):

import requests

cookies = {'ASP.NET_SessionId': 'ihbjzynwfcfvq4nzkncbviou'}
r = requests.get(
    url=('https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/'
         'DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'),
    cookies=cookies
)

with open('2013-0.csv', 'wb') as ofile:
    for chunk in r.iter_content(chunk_size=1024):
        ofile.write(chunk)
        ofile.flush()
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top