Question

So I just got my site kicked off the server today and I think this function is the culprit. Can anyone tell me what the problem is? I can't seem to figure it out:

Public Function CleanText(ByVal str As String) As String    
'removes HTML tags and other characters that title tags and descriptions don't like
    If Not String.IsNullOrEmpty(str) Then
        'mini db of extended tags to get rid of
        Dim indexChars() As String = {"<a", "<img", "<input type=""hidden"" name=""tax""", "<input type=""hidden"" name=""handling""", "<span", "<p", "<ul", "<div", "<embed", "<object", "<param"}

        For i As Integer = 0 To indexChars.GetUpperBound(0) 'loop through indexchars array
            Dim indexOfInput As Integer = 0
            Do 'get rid of links
                indexOfInput = str.IndexOf(indexChars(i)) 'find instance of indexChar
                If indexOfInput <> -1 Then
                    Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput) + 1
                    Dim indexRightBracket As Integer = str.IndexOf(">", indexOfInput) + 1
                    'check to make sure a right bracket hasn't been left off a tag
                    If indexNextLeftBracket > indexRightBracket Then 'normal case
                        str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
                    Else
                        'add the right bracket right before the next left bracket, just remove everything
                        'in the bad tag
                        str = str.Insert(indexNextLeftBracket - 1, ">")
                        indexRightBracket = str.IndexOf(">", indexOfInput) + 1
                        str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
                    End If
                End If
            Loop Until indexOfInput = -1
        Next
    End If
    Return str
End Function
Was it helpful?

Solution

Wouldn't something like this be simpler? (OK, I know it's not identical to posted code):

public string StripHTMLTags(string text)
{
    return Regex.Replace(text, @"<(.|\n)*?>", string.Empty);
}

(Conversion to VB.NET should be trivial!)

Note: if you are running this often, there are two performance improvements you can make to the Regex.

One is to use a pre-compiled expression which requires re-writing slightly.

The second is to use a non-capturing form of the regular expression; .NET regular expressions implement the (?:) syntax, which allows for grouping to be done without incurring the performance penalty of captured text being remembered as a backreference. Using this syntax, the above regular expression could be changed to:

@"<(?:.|\n)*?>"

OTHER TIPS

This line is also wrong:

Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput) + 1

It's guaranteed to always set indexNextLeftBracket equal to indexOfInput, because at this point the character at the position referred to by indexOfInput is already always a '<'. Do this instead:

Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput+1) + 1

And also add a clause to the if statement to make sure your string is long enough for that expression.

Finally, as others have said this code will be a beast to maintain, if you can get it working at all. Best to look for another solution, like a regex or even just replacing all '<' with &lt;.

In addition to other good answers, you might read up a little on loop invariants a little bit. The pulling out and putting back stuff to the string you check to terminate your loop should set off all manner of alarm bells. :)

Just a guess, but is this like the culprit? indexOfInput = str.IndexOf(indexChars(i)) 'find instance of indexChar

Per the Microsoft docs, Return Value - The index position of value if that string is found, or -1 if it is not. If value is Empty, the return value is 0.

So perhaps indexOfInput is being set to 0?

What happens if your code tries to clean the string <a?

As I read it, it finds the indexChar at position 0, but then indexNextLeftBracket and indexRightBracket both equal 0, you fall into the else condition, and then you insert a ">" at position -1, which will presumably insert at the beginning, giving you the string ><a. The new indexRightBracket then becomes 0, so you delete from position 0 for 0 characters, leaving you with ><a. Then the code finds the <a in the code again, and you're off to the races with an infinite memory-consuming loop.

Even if I'm wrong, you need to get yourself some unit tests to reassure yourself that these edge cases work properly. That should also help you find the actual looping code if I'm off-base.

Generally speaking though, even if you fix this particular bug, it's never going to be very robust. Parsing HTML is hard, and HTML blacklists are always going to have holes. For instance, if I really want to get a <input type="hidden" name="tax" tag in, I'll just write it as <input name="tax" type="hidden" and your code will ignore it. Your better bet is to get an actual HTML parser involved, and to only allow the (very small) subset of tags that you actually want. Or even better, use some other form of markup, and strip all HTML tags (again using a real HTML parser of some description).

I'd have to run it through a real compiler but the mindpiler tells me that the str = str.Remove(indexOfInput, indexRightBracket - indexOfInput) line is re-generating an invalid tag such that when you loop through again it finds the same mistake "fixes" it, tries again, finds the mistake "fixes" it, etc.

FWIW heres a snippet of code that removes unwanted HTML tags from a string (It's in C# but the concept translates)

public static string RemoveTags( string html, params string[] allowList )
{
    if( html == null ) return null;
    Regex regex = new Regex( @"(?<Tag><(?<TagName>[a-z/]+)\S*?[^<]*?>)",
                             RegexOptions.Compiled | 
                             RegexOptions.IgnoreCase | 
                             RegexOptions.Multiline );
    return regex.Replace( 
                   html, 
                   new MatchEvaluator( 
                       new TagMatchEvaluator( allowList ).Replace ) );
}

MatchEvaluator class

private class TagMatchEvaluator
{
    private readonly ArrayList _allowed = null;

    public TagMatchEvaluator( string[] allowList ) 
    { 
        _allowed = new ArrayList( allowList ); 
    }

    public string Replace( Match match )
    {
        if( _allowed.Contains( match.Groups[ "TagName" ].Value ) )
            return match.Value;
        return "";
    }
}

That doesn't seem to work for a simplistic <a<a<a case, or even <a>Test</a>. Did you test this at all?

Personally, I hate string parsing like this - so I'm not going to even try figuring out where your error is. It'd require a debugger, and more headache than I'm willing to put in.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top