Question

Our application takes text from a web form and sends it via email to an appropriate user. However, when someone copy/pastes in the infamous "smart quotes" or other special characters from Word, things get hairy.

The user types in

he said “hello” to me—isn’t that nice?

But when the message appears in Outlook 2003, it comes out like this:

he said hello to meisnt that nice?

The code for this was:

Session session = Session.getInstance(props, new MailAuthenticator());
Message msg = new MimeMessage(session);

//removed setting to/from addresses to simplify

msg.setSubject(subject);
msg.setText(text);
msg.setHeader("X-Mailer", MailSender.class.getName());
msg.setSentDate(new Date());
Transport.send(msg);

After a little research, I figured this was probably a character encoding issue and attempted to move things to UTF-8. So, I updated the code thusly:

Session session = Session.getInstance(props, new MailAuthenticator());
MimeMessage msg = new MimeMessage(session);

//removed setting to/from addresses to simplify

msg.setHeader("X-Mailer", MailSender.class.getName());
msg.addHeader("Content-Type", "text/plain");
msg.addHeader("charset", "UTF-8");
msg.setSentDate(new Date());
Transport.send(msg);

This got me closer, but no cigar:

he said “hello” to me—isn’t that nice?

I can't imagine this is an uncommon problem--what have I missed?

Was it helpful?

Solution

Is the page with your form also using UTF-8, or a different charset? If you don't specify the webpage charset, the format of data coming to your script is anyone's guess.


Edit: the charset in the message should be set like this:

msg.addHeader("Content-Type", "text/plain; charset=UTF-8");

since charset is not a separate header, but an option to Content-type

OTHER TIPS

Why don't you replace the nice quotes with regular prime quotes?

I would check that the data being received from the browser is correct - dump the Unicode code points and check them against the charts:

  public static void printCodepoints(char[] s) {
    for (int i = 0; i < s.length; i++) {
      int codePoint = Character.isHighSurrogate(s[i]) ? Character
          .toCodePoint(s[i], s[++i])
          : s[i];
      System.out.println(Integer.toHexString(codePoint));
    }
  }

For example, the symbol DOUBLE LEFT QUOTATION MARK () is character U+201C.

It has been a long time since I used the mail API, but the MimeMessage.html.setText(text, charset) method might be worth a look. The documentation on setText(String) says it uses the default character set (probably windows-1252 if you're using English/Latin-1 Windows).

IIRC, MS Office quotes are found characterset "iso-8859-1".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top