Question

I find this surprising, and rather annoying.

Example:

Decode(”) => ”
Encode(”)       => ”

Relevant classes:

.NET 4:   System.Net.WebUtility
.NET 3.5: System.Web.HttpUtility

I can understand that a web page can be Unicode, but it my case the output cannot be UTF8.

Is there something (perhaps a HtmlWriter class) that could do this without me having to re-invent the wheel?

Alternative solution:

string HtmlUnicodeEncode(string input)
{
    var sb = new StringBuilder();

    foreach (var c in input)
    {
        if (c > 127)
        {
            sb.AppendFormat("&#x{0:X4};", (int)c);
        }
        else
        {
            sb.Append(c);
        }
    }

    return sb.ToString();
}
Was it helpful?

Solution

It is impossible to produce an isomorphic HTML codec pair. Consider:

HtmlDecode("”””””") -> ”””””

how do you get back from ””””” to the original string?

HtmlEncode has to pick one encoding for , and it goes for as the shortest, most readable alternative. As long as you've got working Unicode, that's almost certainly the best choice.

If you don't, that's another argument... the advantage of ” is that it's slightly more readable than ”, but it only works in HTML (not XML) and you still have to fall back to character references for all the Unicode characters that don't have built-in entity names, so it's less consistent. For a character-reference encoder, create an XmlTextWriter using the ASCII encoding and call writeString on it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top