Question

I need to match very short segments of text (1 to 7 letters), and I know how to specify the acceptable strings in a finite state machine. I think that building a regex for this application would get too messy and difficult to maintain. I need an easy to use module that I can write an FSM in, and it would be a big bonus if the module could produce a regular expression for me to use. Does anyone know of a module that can do this simply?

Was it helpful?

Solution

Writing a FSM is quite easy, even without a module.

my $state = 0;
while (my $token = shift @input) {
  $state = $state_table{$state}->($token);
}

where the state table is a hash with states as keys and anonymous subs as values:

0 => sub {
  my $nextstate;
  given (shift) {
    when ('a') {print "its an a"; $nextstate = 2}
    default    {$nextstate = 1}
  }
 return $nextstate;
},
1 => sub {
  my $token = shift;
  my $nextstate = ({
    a => sub {print "Its an a"; 2}
  }->{$token} // sub {1})->();
  return $nextstate;
},
2 => ...

(note that state 0 and 1 are equivalent in this case)

etc. To write a switch in Perl, you can choose between source filters, the given-when construct and hashes, which should make your task easy. Especially hashes (of hashes of hashes ... of subroutines) can make table-driven programming quite easy.

However, if the problem can be easily expressed with a regex, it might be worth doing so instead. Beware that Perl regexes are not limited to regular expression, so you would have to be careful about what features you use. The main advantage of regexes over state machines is execution speed, as the regex engine is highly optimized. The syntax is also much terser, which makes them easier to write. Don't forget that you can include non-semantic whitespace with the /x modifier:

m{
   .*?    # match anything
   (?:
       a  # followed by an a
     | b  # or a b
   )
}x

is absolutely equivalent to (but better readable than)

/.*?(?:a|b)/

So hand-writing regexes is not only potentially much shorter (and every good programmer is lazy), but also a lot of fun.

You could define your states like this:

my $state_machine = qr{
   ^(?&STATESTART)$

   (?(DEFINE)
     (?<STATESTART> (?&STATE0)    )
     (?<STATE0>     a (?&STATE2)
              |     . (?&STATE1) )
     (?<STATE1>     a (?&STATE2)
              |     . (?&STATE1) )
     (?<STATE2>     ...  )
   )
}x;
print "->$x<- is part of the language" if $x =~ $state_machine;

which would correspond to the above sketch of a state machine using hashes. Note that you should only tail recurse when modelling a FSM with named regex patterns.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top