Question

The .NET web system I'm working on allows the end user to input HTML formatted text in some situations. In some of those places, we want to leave all the tags, but strip off any trailing break tags (but leave any breaks inside the body of the text.)

What's the best way to do this? (I can think of ways to do this, but I'm sure they're not the best.)

Was it helpful?

Solution

As @Mitch said,

//  using System.Text.RegularExpressions;

/// <summary>
///  Regular expression built for C# on: Thu, Sep 25, 2008, 02:01:36 PM
///  Using Expresso Version: 2.1.2150, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  Match expression but don't capture it. [\<br\s*/?\>], any number of repetitions
///      \<br\s*/?\>
///          <
///          br
///          Whitespace, any number of repetitions
///          /, zero or one repetitions
///          >
///  End of line or string
///  
///  
/// </summary>
public static Regex regex = new Regex(
    @"(?:\<br\s*/?\>)*$",
    RegexOptions.IgnoreCase
    | RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
    );
regex.Replace(text, string.Empty);

OTHER TIPS

Small change to bdukes code, which should be faster as it doesn't backtrack.

public static Regex regex = new Regex(
    @"(?:\<br[^>]*\>)*$",
    RegexOptions.IgnoreCase
    | RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
);
regex.Replace(text, string.Empty);

I'm sure this isn't the best way either, but it should work unless you have trailing spaces or something.

while (myHtmlString.EndsWith("<br>"))
{
    myHtmlString = myHtmlString.SubString(0, myHtmlString.Length - 4);
}

I'm trying to ignore the ambiguity in your original question, and read it literally. Here is an extension method that overloads TrimEnd to take a string.

static class StringExtensions
{
    public static string TrimEnd(this string s, string remove)
    {
        if (s.EndsWith(remove))
        {
            return s.Substring(0, s.Length - remove.Length);
        }
        return s;
    }
}

Here are some tests to show that it works:

        Debug.Assert("abc".TrimEnd("<br>") == "abc");
        Debug.Assert("abc<br>".TrimEnd("<br>") == "abc");
        Debug.Assert("<br>abc".TrimEnd("<br>") == "<br>abc");

I want to point out that this solution is easier to read than regex, probably faster than regex (you should use a profiler, not speculation, if you're concerned about performance), and useful for removing other things from the ends of strings.

regex becomes more appropriate if your problem is more general than you stated (e.g., if you want to remove <BR> and </BR> and deal with trailing spaces or whatever.

You can use a regex to find and remove the text with the regex match set to anchor at the end of the string.

You could also try (if the markup is likely to be a valid tree) something similar to:

string s = "<markup><div>Text</div><br /><br /></markup>";

XmlDocument doc = new XmlDocument();
doc.LoadXml(s);

Console.WriteLine(doc.InnerXml);

XmlElement markup = doc["markup"];
int childCount = markup.ChildNodes.Count;
for (int i = childCount -1; i >= 0; i--)
{
    if (markup.ChildNodes[i].Name.ToLower() == "br")
    {
        markup.RemoveChild(markup.ChildNodes[i]);
    }
    else
    {
        break;
    }
}
Console.WriteLine("---");
Console.WriteLine(markup.InnerXml); 
Console.ReadKey();

The code above is a bit "scratch-pad" but if you cut and paste it into a Console application and run it, it does work :=)

you can use RegEx or check if the trailing string is a break and remove it

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top