Pergunta

I am looking for a simple way to save a csv file originating from a published Google Sheets document? Since it's published, it's accessible through a direct link (modified on purpose in the example below).

All my browsers will prompt me to save the csv file as soon as I launch the link.

Neither:

DOC_URL = 'https://docs.google.com/spreadsheet/ccc?key=0AoOWveO-dNo5dFNrWThhYmdYW9UT1lQQkE&output=csv'    

f = urllib.request.urlopen(DOC_URL)
cont = f.read(SIZE)
f.close()
cont = str(cont, 'utf-8')
print(cont)

, nor:

req = urllib.request.Request(DOC_URL)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1284.0 Safari/537.13')
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

print anything but html content.

(Tried the 2nd version after reading this other post: Download google docs public spreadsheet to csv with python .)

Any idea on what I am doing wrong? I am logged out of my Google account, if that worths to anything, but this works from any browser that I tried. As far as I understood, the Google Docs API is not yet ported on Python 3 and given the "toy" magnitude of my little project for personal use, it would not even make too much sense to use it from the get-go, if I can circumvent it.

In the 2nd attempt, I left the 'User-Agent', as I was thinking that maybe requests thought as coming from scripts (b/c no identification info is present) might be ignored, but it didn't make a difference.

Foi útil?

Solução 2

Google responds to the initial request with a series of cookie-setting 302 redirects. If you don't store and resubmit the cookies between requests, it redirects you to the login page.

So, the problem is not with the User-Agent header, it's the fact that by default, urllib.request.urlopen doesn't store cookies, but it will follow the HTTP 302 redirects.

The following code works just fine on a public spreadsheet available at the location specified by DOC_URL:

>>> from http.cookiejar import CookieJar
>>> from urllib.request import build_opener, HTTPCookieProcessor
>>> opener = build_opener(HTTPCookieProcessor(CookieJar()))
>>> resp = opener.open(DOC_URL)
>>> # should really parse resp.getheader('content-type') for encoding.
>>> csv_content = resp.read().decode('utf-8')

Having shown you how to do it in vanilla python, I'll now say that the Right Way™ to go about this is to use the most-excellent requests library. It is extremely well documented and makes these sorts of tasks incredibly pleasant to complete.

For instance, to get the same csv_content as above using the requests library is as simple as:

>>> import requests
>>> csv_content = requests.get(DOC_URL).text

That single line expresses your intent more clearly. It's easier to write and easier to read. Do yourself - and anyone else who shares your codebase - a favor and just use requests.

Outras dicas

While the requests library is the gold standard for HTTP requests from Python, this style of download is (while not deprecated yet) not likely to last, specifically referring to the use of links, managing cookies & redirects, etc. One of the reasons for not preferring links is that it's less secure and generally such access should require authorization. Instead, the currently accepted way of exporting Google Sheets as CSV is by using the Google Drive API.

So why the Drive API? Isn't this supposed to be something for the Sheets API instead? Well, the Sheets API is for spreadsheet-oriented functionality, i.e., data formatting, column resize, creating charts, cell validation, etc., while the Drive API is for file-oriented functionality, i.e., import/export, copy, rename, etc.

Below is a complete cmd-line solution. (If you don't do Python, you can use it as pseudocode and pick any language supported by the Google APIs Client Libraries.) For the code snippet, assume the most current Sheet named inventory (older files with that name are ignored) and DRIVE is the API service endpoint:

FILENAME = 'inventory'
SRC_MIMETYPE = 'application/vnd.google-apps.spreadsheet'
DST_MIMETYPE = 'text/csv'

# query for latest file named FILENAME
files = DRIVE.files().list(
    q='name="%s" and mimeType="%s"' % (FILENAME, SRC_MIMETYPE),
    orderBy='modifiedTime desc,name').execute().get('files', [])

# if found, export Sheets file as CSV
if files:
    fn = '%s.csv' % os.path.splitext(files[0]['name'].replace(' ', '_'))[0]
    print('Exporting "%s" as "%s"... ' % (files[0]['name'], fn), end='')
    data = DRIVE.files().export(fileId=files[0]['id'], mimeType=DST_MIMETYPE).execute()

    # if non-empty file
    if data:
        with open(fn, 'wb') as f:
            f.write(data)
        print('DONE')

If your Sheet is large, you may have to export it in chunks -- see this page on how to do that. If you're generally new to Google APIs, I have a (somewhat dated but) user-friendly intro video for you. (There are 2 videos after that which maybe useful too.)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top