Question

Currently I have a simple IRC bot written in python.

Since I migrated it to python 3.0 which differentiates between bytes and unicode strings I started having encoding issues. Specifically, with others not sending UTF-8.

Now, I could just tell everyone to send UTF-8 (which they should regardless) but an even better solution would be try to get python to default to some other encoding or such.

So far the code looks like this:

data = str(irc.recv(4096),"UTF-8", "replace")

Which at least doesn't throw exceptions. However, I want to go past it: I want my bot to default to another encoding, or try to detect "troublesome characters" somehow.

Additionally, I need to figure out what this mysterious encoding that mIRC uses actually is - as other clients appear to work fine and send UTF-8 like they should.

How should I go about doing those things?

Was it helpful?

Solution 4

Ok, after some research turns out chardet is having troubles with python 3. The solution as it turns out is simpler than I thought. I chose to fall back on CP1252 if UTF-8 doesn't cut it:

data = irc.recv ( 4096 )
try: data = str(data,"UTF-8")
except UnicodeDecodeError: data = str(data,"CP1252")

Which seems to be working. Though it doesn't detect the encoding, and so if somebody came in with an encoding that is neither UTF-8 nor CP1252 I will again have a problem.

This is really just a temporary solution.

OTHER TIPS

chardet should help - it's the canonical Python library for detecting unknown encodings.

The chardet will probably be your best solution as RichieHindle mentioned. However, if you want to cover about 90% of the text you'll see you can use what I use:

def decode(bytes):
    try:
        text = bytes.decode('utf-8')
    except UnicodeDecodeError:
        try:
            text = bytes.decode('iso-8859-1')
        except UnicodeDecodeError:
            text = bytes.decode('cp1252')
    return text


def encode(bytes):
    try:
        text = bytes.encode('utf-8')
    except UnicodeEncodeError:
        try:
            text = bytes.encode('iso-8859-1')
        except UnicodeEncodeError:
            text = bytes.encode('cp1252')
    return text

Using only chardet leads to poor results for situations where messages are short (which is the case in IRC).

Chardet combined with remembering the encoding for specific user throughout the messages could make sense. However, for simplicity I'd use some presumable encodings (encodings depend on culture and epoch, see http://en.wikipedia.org/wiki/Internet_Relay_Chat#Character_encoding) and if they fail, I'd go to chardet (if someone uses some of Eastern Asian encodings, this will help us out).

For example:

def decode_irc(raw, preferred_encs = ["UTF-8", "CP1252", "ISO-8859-1"]):
    changed = False
    for enc in preferred_encs:
        try:
            res = raw.decode(enc)
            changed = True
            break
        except:
            pass
    if not changed:
        try:
            enc = chardet.detect(raw)['encoding']
            res = raw.decode(enc)
        except:
            res = raw.decode(enc, 'ignore')
return res
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top