Question

I'm looking to implement a word count algorithm. However, anything that appears within ~two tildes~ is regarded as 1 word. For this reason I think regex is probably the best way to go?

The count must be calculated on every keypress on sample sets of about 10000 characters so it's good to get it right.

Was it helpful?

Solution

str = str.Trim() + " ";

var count = 0;
var inWord = false;
var inTilde = false;

foreach (var c in str)
{
    switch (c)
    {
        case ' ':
        case '\t':
        case '\r':
        case '\n':
            if (!inTilde && inWord)
            {
                inWord = false;
                count++;
            }
        case '~':
            if (inTilde)
            {
                count++;
                inWord = false;
            }

            inTilde = !inTilde;
        default:
            inWord = true;
    }
}

Not tested, but pretty straight forward...

Also, note that ~hi one~two~three four~ will count as hi one, two, three four, as well as ~hi one~two~three four, even though there isn't a closing tilde.

OTHER TIPS

Do you really need to recalculate the whole thing on every keypress? It seems that unless you're in between two spaces, no key but ~ or space can change the number of words. And for those special keys, you should usually be able to determine the changes to the number of words locally without re-processing the whole buffer.

Anyway, you don't need regex. Just flip a flag every time you see a ~.

A simple finite state automata coupled to a numeric counter should work fine.

Say that we have the following states:

OUTSIDE
WORD
TILDEWORD

and we start out in OUTSIDE. Then we can start processing each character, and figure out which state to go to next.

If we're in OUTSIDE:

  1. If we reach the end of the file, stay where we are.

  2. If we see a tilde character, go to the TILDEWORD state and bump up the word counter.

  3. If we see a word character, go to the WORD state and bump up the word counter.

  4. Otherwise, stay where we are.

The case analysis for the other two states should be similar. The whole thing looks almost like a board game.

------> OUTSIDE <----------> WORD
           ^                   ^
           |                   |
           |                   V
           +-------------> TILDEWORD

and writing the C program to keep track of where we are in the finite state automaton is direct.

The arrows might be bidirectional. Consider this input:

hello~happy fabulous world~testing is good

The problem statement is a little fuzzy on what happens when we see a tilde as we're scanning another word; I suspect we must count it as a separate word, so that the above is a sequence of five words. Your interpretation may vary, of course. Consider edge cases!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top