I am obtaining text/HTML BODY
parts of email messages using the IMAP protocol.
For this, what I do is use the BODYSTRUCTURE
call to obtain the BODY
index and the charset of a part, then use the BODY[INDEX]
call, obtain the raw text, and try to decode it using the Python decode function.
Now my problem is, even after decoding some text parts with the given charsets (charset obtained from the BODYSTRUCTURE
call together with that part), they are still encoded with some unknown encoding.
Only Portuguese/Spanish/other latin text comes with this problem, and therefore I assume this is some kind of Portuguese/Spanish encoding.
Now my problem is, how do I detect this occurrence and properly decode it? First of all I assume decoding the text with the given charset should leave no encoded characters, but if that does happen, as it is happening right now, how do I find a universal way to decode these characters?
I assume I could just try a list of common charsets and do a try:
except:
cycle for all of those to try and decode the given text, but I would honestly prefer a better solution.
Pseudocode is something like this:
# Obtain BODYSTRUCTURE call
data, result = imap_instance.uid('fetch', email_uid, '(BODYSTRUCTURE)')
part_body_index, part_charset = parse_BODY_index_and_charset_from_response(data)
text_part, result = imap_instance.uid('fetch', email_uid, '(BODY['+str(part_body_index)+'])')
if len(part_charset) > 0:
try:
text_part = text_part.decode(part_charset, 'ignore')
except:
pass
# Content of "text_part" variable after this should be text with no encoded characters...
# But that's not the case
Examples of encoded text:
A 05/04/2013, =E0s 11:09, XYZ escreveu:>
This text was encoded with iso-8859-1, decoded it and still like this. Symbol =E0 in string is character "À".
In=EDcio da mensagem reenviada:
This text was encoded with windows-1252, decoded it and still like this. Symbol =ED in string is character "í".