Question

I recently wrote a Python script that uploads local, newline-delimited JSON files to a BigQuery table. It's very similar to the example provided in the official documentation here. The problem I'm having is that non-ASCII characters in the file I'm trying to upload are making my POST request barf.

Here's the relevant part of the script...

def upload(dataFilePath, loadJob, recipeJSON, logger):
    body = '--xxx\n'
    body += 'Content-Type: application/json; charset=UTF-8\n\n'
    body += loadJob
    body += '\n--xxx\n' 
    body += 'Content-Type: application/octet-stream\n\n'

    dataFile = io.open(dataFilePath, 'r', encoding = 'utf-8')
    body += dataFile.read()
    dataFile.close()

    body += '\n--xxx--\n'

    credentials = buildCredentials(recipeJSON['keyPath'], recipeJSON['accountEmail'])
    http = httplib2.Http()
    http = credentials.authorize(http)
    service = build('bigquery', 'v2', http=http)

    projectId = recipeJSON['projectId']

    url = BIGQUERY_URL_BASE + projectId + "/jobs"

    headers = {'Content-Type': 'multipart/related; boundary=xxx'}
    response, content = http.request(url, method="POST", body=body, headers=headers)

...and here's the stack trace I get when it runs...

Traceback (most recent call last):
  File "/usr/local/uploader/upload_data.py", line 179, in <module>
    main(sys.argv)
  File "/usr/local/uploader/upload_data.py", line 170, in main
    if (upload(unprocessedFile, loadJob, recipeJSON, logger)):
  File "/usr/local/uploader/upload_data.py", line 100, in upload
    response, content = http.request(url, method="POST", body=body, headers=headers)
  File "/usr/local/lib/python2.7/site-packages/oauth2client/util.py", line 128, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/oauth2client/client.py", line 490, in new_request
redirections, connection_type)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1570, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1317, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1253, in _conn_request
conn.request(method, request_uri, body, headers)
  File "/usr/local/lib/python2.7/httplib.py", line 973, in request
    self._send_request(method, url, body, headers)
  File "/usr/local/lib/python2.7/httplib.py", line 1007, in _send_request
    self.endheaders(body)
  File "/usr/local/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/usr/local/lib/python2.7/httplib.py", line 833, in _send_output
    self.send(message_body)
  File "/usr/local/lib/python2.7/httplib.py", line 805, in send
    self.sock.sendall(data)
  File "/usr/local/lib/python2.7/ssl.py", line 229, in sendall
    v = self.send(data[count:])
  File "/usr/local/lib/python2.7/ssl.py", line 198, in send
    v = self._sslobj.write(data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4586-4611: ordinal not in range(128)

I'm using Python 2.7 and the following libraries: distribute (0.6.36) google-api-python-client (1.1) httplib2 (0.8) oauth2client (1.1) pyOpenSSL (0.13) python-gflags (2.0) wsgiref (0.1.2)

Has anyone else had this problem?

It seems like httplib2's request method takes "body" as a string, which means that it later needs to be encoded before being sent over the wire. I've been searching for a way to override the encoding to UTF-8, but no luck so far.

Thanks in advance!

EDIT:

I was able to resolve this by doing two things: 1.) Reading the contents of my file raw with no decoding. (I could have also just encoded the "body" in my first attempt above...) 2.) Encoding to bytes the url and headers.

The code ended up looking like this:

def upload(dataFilePath, loadJob, recipeJSON, logger):
    part_one = '--xxx\n'
    part_one += 'Content-Type: application/json; charset=UTF-8\n\n'
    part_one += loadJob
    part_one += '\n--xxx\n'
    part_one += 'Content-Type: application/octet-stream\n\n'

    dataFile = io.open(dataFilePath, 'rb')
    part_two = dataFile.read()
    dataFile.close()

    part_three = '\n--xxx--\n'

    body = part_one.encode('utf-8')
    body += part_two
    body += part_three.encode('utf-8')

    credentials = buildCredentials(recipeJSON['keyPath'], recipeJSON['accountEmail'])
    http = httplib2.Http()
    http = credentials.authorize(http)
    service = build('bigquery', 'v2', http=http)

    projectId = recipeJSON['projectId']

    url = BIGQUERY_URL_BASE + projectId + "/jobs"

    headers = {'Content-Type'.encode('utf-8'): 'multipart/related; boundary=xxx'.encode('utf-8')}
    response, content = http.request(url.encode('utf-8'), method="POST", body=body, headers=headers)
Was it helpful?

Solution

io.open() will open the file as unicode text. Either use plain open(), or use binary mode:

dataFile = io.open(dataFilePath, 'rb')

You are sending the file contents straight out over the network, so you need to send bytes, not unicode, and as you found out, mixing Unicode and bytes leads to painful errors as python tries to automatically encode back to bytes using the ASCII codec when concatenating the two different types. There is no need to decode to Unicode at all here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top