Question

I'm obtaining emails from IMAP using Python and imaplib. In this specific case I'm having a problem with the To: addresses.

I extract the encoded To: field, separate the addresses and then try to decode each. I have a problem with this specific string. I'm using the Python decode_header function to decode the Quoted-Printable/Base64 encoding. I start with the encoded:

'=?utf-8?b?vmfzy28gugf0csoty2lv?= <vasco.patricio.pessoal@gmail.com>'

It's supposed to be Vasco Patrício <vasco.patricio.pessoal@gmail.com> (my name and email). As expected, decode_header returns a set of encoded substrings and their encodings, which results in this array of 2 tuples:

[('\xbeg\xf3\xcbo \xba\x07\xf4r\xca-\xcbio', 'utf-8'), ('<vasco.patricio.pessoal@gmail.com>', None)]

However, when I try to decode the first tuple using this very simple code:

for part in decoded_parts:
    if part[1]:
        part_text = part[0].decode(part[1])
    else:
        part_text = part[0]

I obtain a UnicodeDecodeError:

UnicodeDecodeError at /api/refresh/emails/
'utf8' codec can't decode byte 0xbe in position 0: invalid start byte

I confirm that trying to decode it via the console results in the same exception.

Isn't decode_header supposed to return valid de-codable strings together with their encodings?

Thank you

Was it helpful?

Solution

You've lost the capitalization somewhere.

The proper encoded string is =?utf-8?b?VmFzY28gUGF0csOtY2lv?=. Yours appears to be the same, but all lowercase.

Since Base64 is case sensitive (it uses 26 lower case letters, 26 upper case letters, 10 digits, and two other characters to make up 64 letters), lowercasing it will of course completely break it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top