C# HtmlEncode - ISO-8859-1 Entity Names vs Numbers

https://stackoverflow.com/questions/4853920

27-10-2019
|

문제

According to the following table for the ISO-8859-1 standard, there seems to be an entity name and an entity number associated with each reserved HTML character.

So for example, for the character é :

Entity Name : é

Entity Number : é

Similarly, for the character > :

Entity Name : >

Entity Number : >

For a given string, the HttpUtility.HtmlEncode returns an HTML encoded String, but I can't figure out how it works. Here is what I mean :

Console.WriteLine(HtmlEncode("é>"));
//Outputs &#233;&gt;

It seems to be using the entity number for the é character but the entity name for the > character.

So does the HtmlEncode method really work with the ISO-8859-1 standard? If it does, is there a reason why it sometimes uses the entity name and other times the entity number? More importantly, can I force it to give me the entity name reliably?

EDIT : Thanks for the answers guys. I cannot decode the string before I perform the search though. Without getting into too many details, the text is stored in a SharePoint List and the "search" is done by SharePoint itself (using a CAML query). So basically, I can't.

I'm trying to think of a way to convert the entity numbers into names, is there a function in .NET that does that? Or any other idea?

해결책

That's how the method has been implemented. For some known characters it uses the corresponding entity and for everything else it uses the corresponding hex value and there is not much you could do to modify this behavior. Excerpt from the implementation of System.Net.WebUtility.HtmlEncode (as seen with reflector):

...
if (ch <= '>')
{
    switch (ch)
    {
        case '&':
        {
            output.Write("&amp;");
            continue;
        }
        case '\'':
        {
            output.Write("&#39;");
            continue;
        }
        case '"':
        {
            output.Write("&quot;");
            continue;
        }
        case '<':
        {
            output.Write("&lt;");
            continue;
        }
        case '>':
        {
            output.Write("&gt;");
            continue;
        }
    }
    output.Write(ch);
    continue;
}
if ((ch >= '\x00a0') && (ch < 'Ā'))
{
    output.Write("&#");
    output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
    output.Write(';');
}
...

This being said you shouldn't care as this method will always produce valid, safe and correctly encoded HTML.

다른 팁

HtmlEncode is following the spec. The ISO standard specifies both a name and a number for every entity, and the name and the number are equivalent. Therefore, a conforming implementation of HtmlEncode is free to encode all points as numbers, or all as names, or some mixture of the two.

I suggest that you approach your problem from the other direction: call HtmlDecode on the target text, then search through the decoded text using the raw string.

ISO-8859-1 is not really relevant to HTML character encoding. From Wikipedia:

Numeric references always refer to Unicode code points, regardless of the page's encoding.

Only for undefined Unicode code points ISO-8859-1 is often used:

Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so "™", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.

Now to answer your question: For search to work best, you should really search the unencoded HTML (stripping the HTML tags first) using an unencoded search string. Matching encoded strings will lead to unexpected results, like hits based on HTML tags or comments, and hits missing because of differences in the HTML that are invisible in the text.

I made this function, I think it will help

        string BasHtmlEncode(string x)
        {
           StringBuilder sb = new StringBuilder();
           foreach (char c in x.ToCharArray())
               sb.Append(String.Format("&#{0};", Convert.ToInt16(c)));
           return(sb.ToString());
        }

I developed following code to keep a-z,A-Z and 0-1 not encoded but rest:

public static string Encode(string source)
{
    if (string.IsNullOrEmpty(source)) return string.Empty;

    var sb = new StringBuilder(source.Length);
    foreach (char c in source)
    {
        if (c >= 'a' && c <= 'z')
        {
            sb.Append(c);
        }
        else if (c >= 'A' && c <= 'Z')
        {
            sb.Append(c);
        }
        else if (c >= '0' && c <= '9')
        {
            sb.Append(c);
        }
        else
        {
            sb.AppendFormat("&#{0};",Convert.ToInt32(c));
        }
    }

    return sb.ToString();
}

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow