Frage

I've been studying basics recently and as a practice I've decided to implement a DFA in the context of C++. So basically it's regular expressions. It works well when I construct the tree from scratch however I'm not sure how to deal with regular expressions.

What I mean is if I have a regex, for example (test)* I have to convert it into a DFA. The problem is that in order to do that I have to parse the regular expression. This seems to be a vicious circle (it's even worse because I actually need a bracket-aware parser, regular expressions won't work here).

So how to deal with it? I completely understand that we have tools now to do that (e.g. Flex & Bison) but these tools are based on regular expressions (well, at least tokenizers are). So what happened at the begining? How to write a regex parser from scratch? Any reference to a book/article appreciated.

War es hilfreich?

Lösung

I once wrote my own version of Flex, which generated a set of classes instead of the whole program. Firstly, I had to parse the regular expressions by hand, but when I finally wrote it, I replaced the regular expression parsing mechanism with one generated by the program itself.

Manual parsing of the regular expression is actually quite simple. Firstly, you have to specify the result you want to achieve. For example in my case:

[abc]+test

Is being interpreted as:

[abc]@[abc]*@[t]@[e]@[s]@[t]

Which are actually equivalent (@ is an artificially added concatenation operator).

Then you have to create a set of rules, eg.

'[' spotted:
    - (optionally) expect '^' character;
    - repeat:
        - expect a non-special character;
            - If it is not last character and is succeeded by '-', expect another character
    - until `]` is spotted
    - Return a character set
'(' spotted:
    - Return a block-begin
')' spotted:
    - Return a block-end
'*' spotted:
    - Return a star-operator
'+' spotted:
    - Return a plus-operator
'.' spotted:
    - Return a whole character set
Any other char spotted:
    - Return a character set consisting of this single character

Algorithm written like this will give you a tokenizer - routine, which breaks elements into logical tokens. Then you'll have to process them into an expression tree and that may be solved by implementing a Reverse Polish Notation algorithm.

You can check my parser generator here, though it generates a Delphi code. Unfortunately, readme is in Polish, but there are a few examples inside. Try for instance:

Number=[0-9]+
Operator=[\+\-\*/]

And

SpkParserGenerator -i myfile.regex -mc -sg

By the way, you can generate a parser for yourself and then simply translate it from Delphi to C++, it's actually quite simple even if you don't know Delphi well.

This is a set of rules I used to generate parser for the parser generator:

SetRange=\{([0-9]*,[0-9]+)|([0-9]+,[0-9]*)|([0-9]+)\}
Star=\*
Plus=\+
QMark=\?
CharRange=\[\^?((\\.)|(\#[0-9]{3})|([^\\\#\]]))+\]
AnyChar=\.
EscapedChar=\\.
AsciiChar=\#[0-9]{3}
Char=[^\[\]\{\}\.\(\)\#\*\+\?\|\\]
OpenParenthesis=\(
CloseParenthesis=\)
Alternative=\|
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top