How does pyodbc determine the encoding?

https://stackoverflow.com/questions/5884504

28-10-2019
|

Question

I'm fighting Sybase SQL Anywhere 12 together with Python (and Twisted) for several weeks by now and I even got my stuff working.

There's only one annoyance left: If I run my script on CentOS 5 with a custom Python 2.7.1, which is the deployment platform, I get my results as UTF-8.

If I run it on my Ubuntu box (Natty Narwhal) I get them in latin1.

Needless to say, that I would prefer to get all my data in Unicode but that's not the point of this question. :)

Both are 64bit boxes, both have a custom Python 2.7.1. with UCS4 and a custom built unixODBC 2.3.0.

I'm at a loss here. I can't find any documentation on that. What makes pyodbc or unixODBC behave differently on the two boxes?

Hard facts:

Python: 2.7.1
DB: SQL Anywhere 12
unixODBC: 2.3.0 (2.2.14 did behave the same), self compiled with identical flags
ODBC driver: original from Sybase.
CentOS 5 gives me UTF-8, Ubuntu Natty Narwhal gives me latin1.

My odbc.ini looks like this:

[sybase]
Uid             = user
Pwd             = password
Driver          = /opt/sqlanywhere/lib64/libdbodbc12_r.so
Threading       = True
ServerName      = dbname
CommLinks       = tcpip(host=the-host;DoBroadcast=None)

I connect just by using DNS='sybase'.

TIA!

Solution

I can't tell you why it's different, but if you add "Charset=utf-8" to your DSN, you should get the results you want on both machines.

Disclaimer: I work for Sybase in SQL Anywhere engineering.

OTHER TIPS

pyodbc uses the ODBC specification, which only supports 2 encodings. All ODBC functions that end with 'W' are the wide character versions that use SQLWCHAR. This is defined by the ODBC headers and is usually UCS2 but is occasionally UCS4. The non-wide versions use SQLCHAR and are always(?) single-byte ANSI/ASCII.

There is absolutely no support in ODBC for variable width encodings such as UTF8. If ODBC drivers supply that, it is absolutely incorrect. Even if data is stored in UTF8, it must be converted into ANSI or UCS2 by the driver. Unfortunately most ODBC drivers are completely incorrect.

When sending to the driver, pyodbc will use ANSI if the data is a 'str' object and will use UCS2/UCS4 (whatever SQLWCHAR is defined to be on your platform) if the data is a 'unicode' object. The drivers determine whether data is SQLCHAR or SQLWCHAR when returning it and pyodbc does not have any say in the matter. If it is SQLCHAR, it is converted to a 'str' object and if SQLWCHAR is converted to a 'unicode' object.

This will be slightly different for 3.x versions which will convert both SQLCHAR & SQLWCHAR to Unicode by default.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow