Question

I can't seem to store unicode characters correctly in Postgres. They are shown as code representations, e.g. <C3><A5> instead of å for example.

My database was created with UTF-8 as the encoding. I have tried storing the unicode strings with psycopg2 like so:

field = myUnicodeString.encoding('utf-8')
cursor.execute("INSERT INTO mytable (column1) VALUES (%s)", (field))

field = myUnicodeString
cursor.execute("INSERT INTO mytable (column1) VALUES (%s)", (field))

but both alternatives stores incorrect characters. Do I need to set the charset for the table as well, or what can the problem be here?

UPDATE 1:

I have discovered that I can't even type non-ascii characters – like å, ä and ö – in my terminal. I'm on an Ubuntu 12.04 server. Could this in any way be related to the language settings of the server itself?

UPDATE 2

I am now able to type non-ascii characters in my terminal during a SSH session. I changed the locale settings and rebooted the server. Moreover, I am able to manually store non-ascii characters in my UTF-8 database (in psql: INSERT INTO table (column) VALUES ('ö')). The char is displayed correctly in psql.

When I SELECT convert_to(column, 'utf-8') FROM table with the manually inserted row in the table, the char ö is displayed as \xc383c2b6 in psql.

When I do print repr('ö') in Python, I get '\xc3\xb6'. I'm really trying to understand how to debug this, but I'm not sure what to look for.

Était-ce utile?

La solution

It isn't clear that you've confirmed whether the characters are stored incorrectly, or retrieved and displayed incorrectly. Nor is it clear whether the problem is in working with them in PostgreSQL, or before that in Python.

In this case, "å" is unicode code-point U+00E5, encoded in utf-16 BE as 0x00E5 or in utf-8 as 0xc3 0xa5 . That matches what you're seeing - a utf-8 byte sequence representation of "å" - so I suspect your terminal is misconfigured and just can't display it, or is trying to interpret it as latin-1 and doesn't have the right characters in the font for the resulting mangled text:

>>> print u'å'.encode("utf-8").decode("latin-1")
Ã¥

so it's showing the codepoints instead.

It doesn't help that your Python code is nonsense:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'encoding'

I think you meant "encode". It's unnecessary to do this anyway; Python's psycopg2 happily works with unicode string objects directly:

>>> conn = psycopg2.connect("dbname=regress")
>>> curs = conn.cursor()
>>> curs.execute("SELECT %s", (u'áéíóú',));
>>> print curs.fetchone()[0]
áéíóú

With encoding problems you need to trace things through, step by step, to determine where the text encoding is being mishandled.

There's nowhere near enough information to answer a question like this. All I can really offer is general advice. At every step, confirm that you respect the encoding of the input, and that the output from one step is in the same encoding the next step expects as input.

First, you need to make sure that your unicode strings are correct in Python. print repr(mystring) will be useful for this, to see the string data. Then you should stop explicitly encoding them when passing them to psycopg2; just let psycopg2 deal with it.

Next step will be to examine them in the database using psql. Even if they don't display correctly on your terminal you can check if they're right in the database with the convert_to function, which takes a database field or string literal as input and outputs the byte sequence in the desired encoding. So, eg:

SELECT convert_to(column1, 'utf-8') FROM mytable;

and make sure that the hex byte sequence returned matches what it should be for the utf-8 encoding of the text you sent.

Continue with this process. At each step, examine the string bytes to make sure they match what they should be, until you find the stage that's mishandling the text.

I assure you that neither PostgreSQL nor psycopg2 do store Unicode characters incorrectly. In this case it could be as simple an issue as your terminal being set up wrong, or it could be that something in the text handling path is using the wrong input encoding, so you encode something as utf-8 then decode it as latin-1 (for example).

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top