Question

I believe my issue is that python does not play nicely with the character encoding of a column in a SQL table:

| column | varchar(255) | latin1_swedish_ci | YES  |     | NULL              |                             | select,insert,update,references |    | 

The above shows the output for this column. It has type varchar(255) and has encoding latin1_swedish_ci.

Now when I try to make python play with this data, I am getting the following error:

 dictionary = gs.corpora.Dictionary(tweets)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 50, in __init__
    self.add_documents(documents)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 97, in add_documents
    _ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 121, in doc2bow
    document = sorted(utils.to_utf8(token) for token in document)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 121, in <genexpr>
    document = sorted(utils.to_utf8(token) for token in document)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/utils.py", line 164, in any2utf8
    return unicode(text, encoding, errors=errors).encode('utf8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte

gs is the gensim topic modeling library. I believe that the problem is that gensim requires unicode encodings.

  1. How can I change the character encoding (collation?) for this column in my database?
  2. Is there an alternative solution?

Thanks for all the help!

Was it helpful?

Solution

I think that your MYSQLdb python library doesn't know it's supposed to encode to utf8

and is encoding to the default python system-defined charset latin1.

When you connect() to your database, pass the charset='utf8'

parameter. This should also make a manual SET NAMES

OTHER TIPS

For question 1, you'll need to use

alter table t 
modify col varchar(255) 
character set utf8
collate utf8_unicode_ci

I don't know about question 2.

I tried @saudi_Dev's solution, with MySQLdb v1.2.5. The table I query has been created with DEFAULT CHARSET=utf8. Even so, before trying @saudi_Dev's solution, cursor.fetchall() returned strings in latin1 for some reason. After using charset=utf8 parameter, cursor.fetchall() returns strings as Unicode (technically not utf8) instead of latin1.

But I have seen in http://mysql-python.sourceforge.net/MySQLdb.html that you can also pass the parameter use_unicode=False. This happens because, according to the User's Guide from the link I posted, using charset parameter implies use_unicode=True.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top