Question

I have looked around a lot but have not been able to find a built-in .Net method that will only escape special XML characters: <, >, &, ' and " if it's not a tag.

For example, take the following text:

Test& <b>bold</b> <i>italic</i> <<Tag index="0" />

I want it to be converted to:

Test&amp; <b>bold</b> <i>italic</i> &lt;<Tag index="0" />

Notice that the tags are not escaped. I basically need to set this value to an InnerXML of an XmlElement and as a result, those tags must be preserved.

I have looked into implementing my own parser and use a StringBuilder to optimize it as much as I can but it can get pretty nasty.

I also know the tags that are acceptable which may simplify things (only: br, b, i, u, blink, flash, Tag). In addition, these tags can be self closing tags

(e.g. <u />)

or container tags

(e.g. <u>...</u>)
Was it helpful?

Solution

NOTE: This could probably be optimised. It was just something I knocked up quickly for you. Also note that I am not doing any validation of the tags themselves. It's just looking for content wrapped in angle brackets. It will also fail if an angle bracket was found within the tag (e.g. <sometag label="I put an > here"> ). Other than that, I think it should do what you're asking for.

namespace ConsoleApplication1
{
    using System;
    using System.Text.RegularExpressions;

    class Program
    {
        static void Main(string[] args)
        {
            // This is the test string.
            const string testString = "Test& <b>bold</b> <i>italic</i> <<Tag index=\"0\" />";

            // Do a regular expression search and replace. We're looking for a complete tag (which will be ignored) or
            // a character that needs escaping.
            string result = Regex.Replace(testString, @"(?'Tag'\<{1}[^\>\<]*[\>]{1})|(?'Ampy'\&[A-Za-z0-9]+;)|(?'Special'[\<\>\""\'\&])", (match) =>
                {
                    // If a special (escapable) character was found, replace it.
                    if (match.Groups["Special"].Success)
                    {
                        switch (match.Groups["Special"].Value)
                        {
                            case "<":
                                return "&lt;";
                            case ">":
                                return "&gt;";
                            case "\"":
                                return "&quot;";
                            case "\'":
                                return "&apos;";
                            case "&":
                                return "&amp;";
                            default:
                                return match.Groups["Special"].Value;
                        }
                    }

                    // Otherwise, just return what was found.
                    return match.Value;
                });

            // Show the result.
            Console.WriteLine("Test String: " + testString);
            Console.WriteLine("Result     : " + result);
            Console.ReadKey();
        }
    }
}

OTHER TIPS

I personally don't think it is possible, because you are really trying to fix malformed HTML, and therefore there are no rules which you can use to determine what is to be encoded and what isn't.

Any which way you look at it, something like <<Tag index="0" /> is not valid HTML.

If you know the actual tags you may be able create a white list which could simplify things, but you are going to have to attack your problem more specifically, I do not think you will be able to solve this for any scenario.

In fact, chances are you haven't actually got any random < or > lying around in your text, and that would (probably) greatly simplify the problem, but if you are really trying to come up with a generic solution....I wish you luck.

Here's a regular expression you can use that will match any invalid < or >.

(\<(?! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))|(?<! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))\>)

I suggest putting the valid tag-test expression into a variable and then constructing the rest around it.

var validTags = "b|i|br|u|blink|flash|Tag[^>]*";
var startTag = @"\<(?! ?/?(?:" + validTags + "))";
var endTag = @"(?<! ?/?(?:" + validTags + "))/>";

Then just do RegEx.Replace on these.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top