Question

I'm working on a search function for an MVC C# app that will place a (possibly large) chunk of text through a filter, and given the search query, will place an html <span> with a highlighted style before and after each search term.

I've got a simple algorithm working, but I've got a feeling it will be slow, probably because of the amount of strings that will need to be created (2 * the number of matches).

public static string Surround(string original, string head, string tail, string match, StringComparison comparer)
{
    var ret = original;

    if (ret.IndexOf(match, 0, comparer) != -1)
    {
        var lastIndex = 0;

        while ((lastIndex = ret.IndexOf(match, lastIndex, comparer)) != -1)
        {
            ret = ret.Insert(lastIndex, head);
            var tailIndex = lastIndex + match.Length + head.Length;
            lastIndex = tailIndex;
            ret = ret.Insert(tailIndex, tail);
        }
    }

    return ret;
}

I'm wondering if anyone can give some hints for a better algorithm that would perform better for large chunks of text? I was thinking of using a stringbuilder, but it's also occurred to me that I could be approaching this from entirely the wrong way. Any insight would be greatly appreciated.

Was it helpful?

Solution

A regular expression will do the job, and the code should be a lot simpler. However you'd need to test to determine if it actually delivers better performance. Something like this:

public static string Surround(
    string original, string head, string tail, string match)
{
    return Regex.Replace(
        original, match, head + "$0" + tail, RegexOptions.IgnoreCase);
}

Even better if you can pass in the replacer whole as you save 2N string concats:

public static string Surround(string original, string replacer, string match)
{
    return Regex.Replace(original, match, replacer, RegexOptions.IgnoreCase);
}

Surround("foo bar baz", "<span>$&</span>", "bar");  //call like so

OTHER TIPS

A StringBuilder will of course be much faster, but never the less when you do an .Insert you will be moving around a whole lot of data each time. So it would be better to build up the result in the StringBuilder using only .Append. Not .Insert.

However, you should also be able to use a RegularExpression for this, although I don't know the syntax by heart (I use a wonderful tool called RegEx Buddy to build my regular expressions when I have the need.).

EDIT: Very few people in this world have the ability to distinguish a regular expression from tranismission line noise. :-)

Actually RegEx are not working well on large chunks of text. Consider using NFAs instead.

http://swtch.com/~rsc/regexp/regexp1.html

Luke's answer is good, but you might want to escape any special characters in "match" first in case someone uses a .\^$? etc in their search (unless you want to allow that of course).

ETA: Allowing special characters would allow for a more powerful search, but would result in unexpected output as well, since the found text would be replaced with the pattern rather than the actual matching text.

The RegEx solution is definitely more elegant than what I've produced. However, using StringBuilder is generally a little faster and doesn't need the search term and the pre-/post-fixes to be regex-escaped.

private static string Surround(string original, string head, string tail, string match, StringComparison comparisonType)
{
  if (string.IsNullOrEmpty(original) || string.IsNullOrEmpty(match) || (string.IsNullOrEmpty(head) && string.IsNullOrEmpty(tail)))
    return original;

  var resultBuilder = new StringBuilder();
  int matchLength = match.Length;
  int lastIdx = 0;

  for (;;)
  {
    int curIdx = original.IndexOf(match, lastIdx, comparisonType);

    if (curIdx > -1)
      resultBuilder
        .Append(original, lastIdx, curIdx - lastIdx)
        .Append(head)
        .Append(original, curIdx, matchLength)
        .Append(tail);
    else
      return resultBuilder.Append(original.Substring(lastIdx)).ToString();

    lastIdx = curIdx + matchLength;
  }
}

I hope you can use it.

Update Doing a little perf test it seems that my solution is only faster if you are searching for "short" words. If the word is long (ie. length > 7) Regex wins, but if the word is short (ie. length < 7) then my solution wins. Anyone know why this is? What operation is so sensitive to length? Is it IndexOf(string, int, int) or maybe Append(string, int, int)?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top