Question

I would like to be able to compute the set of all characters which may be matched as the first character in a string by a given instance of java.util.regex.Pattern. More formally, given the DFA equivalent to a certain regular expression, I want the set of all outgoing transitions from the start state.

An example:

Pattern p = Pattern.compile("[abc]def|daniel|chris|\\s+");
Set<Character> first = getFirstSet(p);

The set first should contain the following elements:

{ 'a', 'b', 'c', 'd', ' ', '\n', '\r', '\t' }

Any ideas? I'm well aware that I could construct the DFA myself and determine the relevant states that way, but I'd like to avoid that kind of hassle (read: it's not worth that much to me). Note that my host language is actually Scala, so I have access to all of the core Scala libs (for what it's worth).

Was it helpful?

Solution

I think you could parse the regular expression and define some recursive function which operates on the parsed regular expression in a left-to-right-manner, building up such a set of firsts.

Some things are simple:

  • Sequence: first(r1r2) = first(r1) + ( if '' in first(r1) first(r2) else empty set )
  • Alternation: first(r1|r2) = first(r1) + first(r2)
  • Iteration: first(r*) = first(r) + ''
  • Characters: first(c) = c
  • Characterclasses: first([c1-cn]) = set(c1, c2, ..., cn) ...

Extend this to all primitives and special flags your regular expression dialect knows and you are good to go.

OTHER TIPS

You could solve it recursivly ...

  • Strip of enclosing parenthesis and call recursivly.
  • Split at toplevel alternatives and call recursivly for each part.
  • If there are no alternatives,
    • output all symbols starting from the left up to the first none optional symbol.
    • If there are charachter groups, output all symbols.

There are probably a lot of errors in this idea, but this is what I would try. You have to strip out assertion, group names and thousand other things. And if you find an inverted character class like [^0-9] you have to output a lot of characters.

So I assume it is really a complex problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top