Question

I am obtaining text/HTML BODY parts of email messages using the IMAP protocol.

For this, what I do is use the BODYSTRUCTURE call to obtain the BODY index and the charset of a part, then use the BODY[INDEX] call, obtain the raw text, and try to decode it using the Python decode function.

Now my problem is, even after decoding some text parts with the given charsets (charset obtained from the BODYSTRUCTURE call together with that part), they are still encoded with some unknown encoding.

Only Portuguese/Spanish/other latin text comes with this problem, and therefore I assume this is some kind of Portuguese/Spanish encoding.

Now my problem is, how do I detect this occurrence and properly decode it? First of all I assume decoding the text with the given charset should leave no encoded characters, but if that does happen, as it is happening right now, how do I find a universal way to decode these characters?

I assume I could just try a list of common charsets and do a try: except: cycle for all of those to try and decode the given text, but I would honestly prefer a better solution.

Pseudocode is something like this:

# Obtain BODYSTRUCTURE call
data, result = imap_instance.uid('fetch', email_uid, '(BODYSTRUCTURE)')
part_body_index, part_charset = parse_BODY_index_and_charset_from_response(data)

text_part, result = imap_instance.uid('fetch', email_uid, '(BODY['+str(part_body_index)+'])')

if len(part_charset) > 0:
    try:
        text_part = text_part.decode(part_charset, 'ignore')
    except:
        pass

# Content of "text_part" variable after this should be text with no encoded characters...
# But that's not the case

Examples of encoded text:

A 05/04/2013, =E0s 11:09, XYZ escreveu:>

This text was encoded with iso-8859-1, decoded it and still like this. Symbol =E0 in string is character "À".

In=EDcio da mensagem reenviada:

This text was encoded with windows-1252, decoded it and still like this. Symbol =ED in string is character "í".

Was it helpful?

Solution

You need to look at the Content-Transfer-Encoding information (which is actually returned in the BODYSTRUCTURE responses). You'll need to support both base64 and quoted-printable decoding -- this transforms the binary data (like UTF-8 or even ISO-8859-1 encoding of a given text) into a 7bit form which is safe for an e-mail transfer. Only after you've undone the content transfer encoding should you go ahead and decode the text from a character encoding (like UTF-8, or windows-1250, or ISO-8859-x, or...) to its Unicode representation that you work with.

Both of your examples are encoded using quoted-printable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top