Python unicode issues (2.6)

https://stackoverflow.com/questions/2547517

23-09-2019
|

Question

I'm currently working on a irc bot for a multi-lingual channel, and I'm encountering some issues with unicode which are proving nearly impossible to solve.

No matter what configuration of unicode encoding I seem to try, the list function which the below code sits within just flat out does nothing (c.notice is a class function which sends a NOTICE command to the irc server) or when it does do something, spits out something which obviously isn't encoded.

The command should be sending 天子, but instead it seems hellbent on sending å¤©å with a previous configuration of the same commands. The one I have specified below is of the 'send nothing' variety. I haven't worked with unicode before this, and thus I am quite stuck. I'm also positive that I'm doing this completely wrong as a consequence.

(compileCMD just takes a list and spits out a single string of all the elements within the list)

uk = self.compileCMD(self.faq.keys(),0)
ukeys = unicode(uk,"utf-8").encode("utf-8")
c.notice(nick, u"Current list of faq entries: %s" % (uk))

Solution

A few points:

The bytes "å¤©å" are the UTF-8 encoding of "天子", so are you sure it's wrong that this is sent? Does the program/... that should process the data use UTF-8, or does it just interpret the input as a different encoding like Latin-1?
unicode(uk,"utf-8").encode("utf-8"): Decoding UTF-8 and then reencoding as UTF-8 doesn't change anything.
ukeys = unicode(uk,"utf-8").encode("utf-8"): The ukeys variable that contains the reencoded data is not used later on.

OTHER TIPS

Turns out the issue was with the client I was using to test the output - it wasn't handling unicode properly itself!

Change this:

u"Current list of faq entries: %s" % (uk)

into this:

"Current list of faq entries: %s" % (uk)

and try again. Make sure that uk is already a UTF-8 encoded string (not unicode).

I assume that the c.notice method takes an encoded string as argument, since it's got to send an encoded string over the wire. If the channel is multilingual, it's a safe bet that it expects it to be encoded as UTF-8. Also, drop the useless ukeys = unicode(uk,"utf-8").encode("utf-8") line.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow