Question

I'm working on a state machine which is supposed to extract function calls of the form

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

where the extracted data would be pref("this.is.a.string.which\"can have QUOTES\"", 123456); from a file. Currently, to process a 41kb file, this process is taking close to a minute and a half. Is there something I'm seriously misunderstanding here about this finite state machine?

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

Billy3

EDIT: Well this is one of the strangest things I've ever seen. I just rebuilt the project and it's running reasonably again. Odd.

Was it helpful?

Solution

Unless your 41 kb file is mostly comments or prefs, it's going to spend most of its time in state 0. And for each character in state 0, you make a minimum of two function calls.

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

You can speed this up by pre-testing to see if the current character is 'p'

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

If the character isn't 'p' then there is no need to make any function calls. In particular not creating a iterator, which is probably where the time is being spent.

OTHER TIPS

I don't know if this is part of the problem, but you have a typo in case 0, "perf" is misspelled as "pref".

Well it's hard to say just by looking at it...but I'm guessing the find algorithms are doing it. Why are you searching within a FSM? By definition you're supposed to giving them one character at a time....Add more states. Also try making results a list, not a vector. A lot of copying going on with

vector<string>

But mostly: Profile it!

Finite state machines are a workable solution, but for text processing, its best to use a highly optimized finite state machine generator. In this case, a regular expression. Here it is as Perl regex:

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

Since most regex libraries are Perl compatible, it shouldn't be hard to translate the syntax. If your search becomes more complicated, then a parser is in order.

If you are doing parsing, why not use a parser library.

I typically have Boost.Spirit.Qi in mind.

  • You express your grammar with EBNF-like expressions which certainly makes it easier for maintenance.
  • It's a header only library, so you don't have any problem of bringing a whole binary in the mix.

While I can appreciate the minimalist approach, I am afraid that your idea of coding a finite state machine on your own is not that wise. It works well with a toy example, but as the requirements add up you'll have one monstrous switch and it's going to become more and more complicated to understand what's going on.

And please, don't tell me you know it's not going to evolve: I don't believe in oracles ;)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top