Question

I want to create a grammar and lexer to parse the below string:

100 reason phrase

regular expression will be: "\d{3} [^\r\n]*"

token definition:

template <typename Lexer>
struct custom_tokens : lex::lexer<Lexer>
{
    custom_tokens()
    {
        this->self.add_pattern
            ("STATUSCODE", "\\d{3}")                
            ("SP", " ")
            ("REASONPHRASE", "[^\r\n]*")
            ;                

        this->self.add                          
            ("{STATUSCODE}", T_STATUSCODE)
            ("{SP}", T_SP)
            ("{REASONPHRASE}", T_REASONPHRASE)
            ;
    }   
};

grammar:

template <typename Iterator>
struct custom_grammar : qi::grammar<Iterator >
{
    template <typename TokenDef>
    custom_grammar(TokenDef const& tok)
        : custom_grammar::base_type(start)            
    {            
        start = (qi::token(T_STATUSCODE) >> qi::token(T_SP) >> qi::token(T_REASONPHRASE));
    }

    qi::rule<Iterator> start;
};

however, I realized that I couldn't define token "T_REASONPHRASE" because it will match everything including "T_STATUSCODE". what I can do is

  1. undefine T_REASONPHRASE and use qi::lexeme to write a rule inside custom_grammar?

  2. can I use lex state to do that? e.g. define "T_REASONPHRASE" in second state, if it sees T_STATUSCODE in first state then parse the rest to second state? please give an example?

Was it helpful?

Solution

I don't think there really is a problem, because tokens are 'greedily' matched in the order they've been added to the token definitions (for a specific lexer state).

So, given

    this->self.add                          
        ("{STATUSCODE}", T_STATUSCODE)
        ("{SP}", T_SP)
        ("{REASONPHRASE}", T_REASONPHRASE)
        ;

T_STATUSCODE will always match before T_REASONPHRASE (if there is an ambiguity at all).


About using separate Lexer states, here's an excerpt of a tokenizer I once had in a toy project:

this->self = fileheader     [ lex::_state = "GT" ];

this->self("GT") =
    gametype_label |
    gametype_63000 | gametype_63001 | gametype_63002 |
    gametype_63003 | gametype_63004 | gametype_63005 |
    gametype_63006 |
    gametype_eol            [ lex::_state = "ML" ];

this->self("ML") = mvnumber [ lex::_state = "MV" ];

this->self("MV") = piece | field | op | check | CASTLEK | CASTLEQ 
         | promotion
         | Checkmate | Stalemate | EnPassant
         | eol              [ lex::_state = "ML" ]
         | space            [ lex::_pass = lex::pass_flags::pass_ignore ];

(The purpose would be relatively clear if you read GT as gametype, ML: move line and MV: move; Note the presence of eol and gametype_eol here: Lex disallows adding the same token to different states)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top