Loading Unicode data from mysql to Redshift fails with "Bad UTF8 hex sequences"

https://stackoverflow.com/questions/22259353

11-06-2023
|

Question

I'm trying to create a simple table replicator from MySQL to Redshift using Python. The way I'm doing this is to query tables in MySQL and write the output to CSVs using Python (2.7), then ship those up to S3 and do a COPY on them into their respective target tables.

I'm running into a problem with Unicode characters. Specifically, I get the following error:

String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence: e9 20 50 (error 4)

My question here is whether this is a python problem, or an S3/Redshift problem. Here's what I'm doing in python:

import unicodecsv as csv

csv_writer = csv.writer(dest, encoding='utf-8')
for index,line in enumerate(a):
    if index == len(a)/2:
        file_ext+=1
        if dest: dest.close()
        dest = open(config['data_dir'] + directory + '/' + table_name + '.txt.' + str(file_ext), 'wb')
        csv_writer = csv.writer(dest, encoding='utf-8')
    csv_writer.writerow(line)

So from what I understand, Python is writing things correctly. Indeed, if I open up the CSV in VI, I can see this: "Fjällräven Canvas Black Kanken 15\ Laptop Bag""" So that looks right to me (the \ and extra " are junk from the source). However, if I run file against the csv, I get: ASCII text, with very long lines, with CRLF line terminators. After moving the file to S3 and running a copy, I end up with the above Redshift COPY error.

And so, back to the question: I suspect this has something to do with the way the file is encoded, not the content that is in there, but I haven't been able to find anything definitive about that through my searches. Has anyone encountered this, and have they found a solution? Thanks for the help

Solution

Turns out that everything I showed above was fine, but that MySQL wasn't exporting UTF-8 characters. It was fixed by adding the following two lines to my connection string:

'use_unicode' : True,
'charset':'utf8'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow