Why are HtmlEncode and HtmlDecode not isomorphic in .NET?

https://stackoverflow.com/questions/16057398

04-04-2022
|

Question

I find this surprising, and rather annoying.

Example:

Decode(&rdquo;) => ”
Encode(”)       => ”

Relevant classes:

.NET 4:   System.Net.WebUtility
.NET 3.5: System.Web.HttpUtility

I can understand that a web page can be Unicode, but it my case the output cannot be UTF8.

Is there something (perhaps a HtmlWriter class) that could do this without me having to re-invent the wheel?

Alternative solution:

string HtmlUnicodeEncode(string input)
{
    var sb = new StringBuilder();

    foreach (var c in input)
    {
        if (c > 127)
        {
            sb.AppendFormat("&#x{0:X4};", (int)c);
        }
        else
        {
            sb.Append(c);
        }
    }

    return sb.ToString();
}

Solution

It is impossible to produce an isomorphic HTML codec pair. Consider:

HtmlDecode("&rdquo;”&#x201D;&#x201d;&#8221;") -> ”””””

how do you get back from ””””” to the original string?

HtmlEncode has to pick one encoding for ”, and it goes for ” as the shortest, most readable alternative. As long as you've got working Unicode, that's almost certainly the best choice.

If you don't, that's another argument... the advantage of ” is that it's slightly more readable than ”, but it only works in HTML (not XML) and you still have to fall back to character references for all the Unicode characters that don't have built-in entity names, so it's less consistent. For a character-reference encoder, create an XmlTextWriter using the ASCII encoding and call writeString on it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow