Mime 7bit encoding and UnsupportedEncodingException

https://stackoverflow.com//questions/12674610

12-12-2019
|

Domanda

I have implemented an approach, but I am not sure whether it is the a correct one or could give me problems in the future.
Giving this piece of email:

Date: Mon, 17 Sep 2012 04:14:36 +0200   
Content-Type: text/plain;
    charset="utf-7"   
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000
To: user@address.com

Dear Sir/madam, ... etc

And this piece of code:

MimePart part; //The email 
if (part.isMimeType("text/plain")) {
   String plainContent = part.getContent().toString();

The exception was:

java.io.UnsupportedEncodingException: utf-7

I have made this modification, so the charset is always utf-8 and the encoding quoted-printable

part.setHeader("Content-Transfer-Encoding", "quoted-printable");
part.setHeader("Content-Type", "text/plain; charset=utf-8");

The exception is not there anymore and the plainContent is correct. But it seems to be too easy solution... Which problems could I get in the future? Is there a better way to skip the exception and to get the email content without forcing a carset and encoding??

Soluzione

If somebody really sends UTF-7, you will cause the client to decode it incorrectly. But it's quite rare; most sites send UTF-8 if they use Unicode at all. For the sample content you posted, it's pure ASCII, so it's valid both UTF-7 and UTF-8. (UTF-7 assigns special semantics to + and - so for a message which contains sequences of these characters, even ASCII is not safe. That is, UTF-7 incorrectly labeled as US-ASCII or vice versa will decode incorrectly.)

Assigning Quoted-Printable to stuff which really isn't is similarly haphazard; any equals sign in the message has special meaning in QP. I think you should just leave it.

The proper solution is to really recode the message body, i.e. translate from UTF-7 to UTF-8 (and possibly wrap it in quoted-printable), then assign the correct content-type header; or, convince whatever is sending these messages to stick to plain old US-ASCII or switch to UTF-8. (Or, find out how to teach Java to handle UTF-7 encoding; but that's outside my competence.)

Basic RFC822 email was purely 7-bit. In order to enable rich content and different character sets, MIME was developed in the early 1990s. Central to your question are two MIME headers, Content-Type: and Content-Transfer-Encoding:. These are both used to identify the type of a MIME part, but they are distinct concepts. The Content-Type describes what the data is (text/html, audio/midi, application/octet-stream for untyped binary data, etc). The Content-Transfer-Encoding: indicates how it has been encoded for transmission over email (or another MIME conduit).

Content-Transfer-Encoding: basically defines two encodings and three unencoded types. CTE: 7bit indicates that the data, by itself, is suitable for transmission over a 7-bit channel (there is also a line length restriction); 8bit is not, and will need to be re-encoded if the channel cannot accommodate 8-bit data. Similarly, binary is also 8-bit but in addition has no guarantee on line length (i.e. it may contain lines longer than approx 1,000 characters). So to transmit binary or 8-bit data across a 7-bit channel, you need to recode the content as base64 or quoted-printable. Both of these encodings substitute 8-bit characters with 7-bit sequences; the recipient is expected to perform the reverse substitution in order to decode and extract the data.

Once the extraction happens, the data is basically ready for use at the recipient end. However, for text types, there is also the matter of character set encoding. Many character sets are simply 7-bit or 8-bit, and so a byte in the stream corresponds to a character. But multibyte character sets do not behave like this, and so they, too, need to be encoded somehow. But this is distinct from the MIME 7bit/8bit thing described above. A character encoding tells you how the byte stream encodes multi-byte characters.

UTF-8 encodes a multibyte character as a sequence of 8-bit characters (while conveniently 7-bit characters are identical to the US-ASCII 7-bit encoding). The encoding has some nice properties which you can read about in Wikipedia.

UTF-7 was never formally accepted as an official Unicode encoding, and is not in widespread use. It is not entirely compatible with US-ASCII, because the + and - characters are used to encode multibyte character sequences.

If you wish to decode UTF-7 and your language does not support the encoding, you will have to write your own decoder. The alternative is not to decode the encoding, and leave it to the downstream consumer to decode. Take care to somehow relay the character encoding to the downstream in this case. However, because UTF-7 is not widely supported, I would recommend recoding to UTF-8, which is widely supported and understood (and also, as mentioned, transparently compatible with US-ASCII if no multibyte characters are present).

So, just to summarize; if you change the headers, you also have to change the encoding. If you are lucky (and your example is representative) the text doesn't contain any actual encoded UTF-7 multibyte characters, in which case you can safely relabel it as US-ASCII. If it does contain + or - characters, they are part of UTF-7 sequences which need to be decoded (though again, you could be lucky, and the sequences are just the UTF-7 escapes which encode a literal plus or minus sign).

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow