Question

I am trying to read an email from POP3 and change to the correct encoding when I find the charset in the headers.

I use a TCP Client to connect to the POP3 server.

Below is my code :

    public string ReadToEnd(POP3Client pop3client, out System.Text.Encoding messageEncoding)
    {
        messageEncoding = TCPStream.CurrentEncoding;
        if (EOF)
            return ("");

        System.Text.StringBuilder sb = new System.Text.StringBuilder(m_bytetotal * 2);
        string st = "";
        string tmp;

        do
        {
            tmp = TCPStream.ReadLine();
            if (tmp == ".")
                EOF = true;
            else
                sb.Append(tmp + "\r\n");

            //st += tmp + "\r\n";

            m_byteread += tmp.Length + 2; // CRLF discarded by read

            FireReceived();

            if (tmp.ToLower().Contains("content-type:") && tmp.ToLower().Contains("charset="))
            {
                try
                {
                    string charSetFound = tmp.Substring(tmp.IndexOf("charset=") + "charset=".Length).Replace("\"", "").Replace(";", "");
                    var realEnc = System.Text.Encoding.GetEncoding(charSetFound);

                    if (realEnc != TCPStream.CurrentEncoding)
                    {
                        TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
                    }
                }
                catch { }
            }                
        } while (!EOF);

        messageEncoding = TCPStream.CurrentEncoding;

        return (sb.ToString());
    }

If I remove this line:

TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);

Everything works fine except that when the e-mail contains different charset characters I get question marks as the initial encoding is ASCII.

Any suggestions on how to change the encoding while reading data from the Network Stream?

Was it helpful?

Solution 2

There are some ways you can detect the encoding by looking at the Byte Order Mark, which are the firts few bytes of the stream. These will tell you the encoding. However, the stream might not have a BOM, and in these cases it could be ASCII, UTF without BOM, or others.

You can convert your stream from one encoding to another with the Encoding Class:

Encoding textEncoding = Encoding.[your detected encoding here];
byte[] converted = Encoding.UTF8.GetBytes(textEncoding.GetString(TCPStream.GetBuffer()));

You may select your preferred encoding when converting.

Hope it answers your question.

edit
You may use this code to read your stream in blocks.

MemoryStream st = new MemoryStream();
int numOfBytes = 1024;
int reads = 1;
while (reads > 0)
{
    byte[] bytes = new byte[numOfBytes];
    reads = yourStream.Read(bytes, 0, numOfBytes);
    if (reads > 0)
    {
        int writes = ( reads < numOfBytes ? reads : numOfBytes);
        st.Write(bytes, 0, writes);
    }
}

OTHER TIPS

You're doing it wrong (tm).

Seriously, though, you are going about trying to solve this problem in completely the wrong way. Don't use a StreamReader for this. And especially don't read 1 byte at a time (as you said you needed to do in a comment on an earlier "solution").

For an explanation of why not to use a StreamReader, besides the obvious "because it isn't designed to switch between encodings during the process of reading", feel free to read over another answer I gave about the inefficiencies of using a StreamReader here: Reading an mbox file in C#

What you need to do is buffer your reads (such as a 4k buffer should be fine). Then, as you are already having to do anyway, scan for the '\n' byte to extract content on a line-by-line basis, combining header lines that were folded.

Each header may have multiple encoded-word tokens which may each be in a separate charset, assuming they are properly encoded, otherwise you'll have to deal with undeclared 8-bit data and try to massage that into unicode somehow (probably by having a set of fallback charsets). I'd recommend trying UTF-8 first followed by a selection of charsets that the user of your library has provided before finally trying iso-8859-1 (make sure not to try iso-8859-1 until you've tried everything else, because any sequence of 8-bit text will convert properly to unicode using the iso-8859-1 character encoding).

When you get to text content of the message, you'll want to check the Content-Type header for a charset parameter. If no charset parameter is defined, it should be US-ASCII, but in practice it could be anything. Even if the charset is defined, it might not match the actual character encoding used in the text body of the message, so once again you'll probably want to have a set of fallbacks.

As you've probably guessed by this point, this is very clearly not a trivial task as it requires the parser to do on-the-fly character conversion as it goes (and the character conversion requires internal parser state about what the expected charset is at any given time).

Since I've already done the work, you should really consider using MimeKit which will parse the email and properly do charset conversion on the headers and the content using the appropriate charset encoding.

I've also written a Pop3Client class that is included in my MailKit library.

If your goal is to learn and write your own library, I'd still highly recommend reading over my code because it is highly efficient and does things in a proper way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top