Pergunta

Let's say that I have a list of regular expressions (read from an external source - file, database etc). I want to check which of these regular expressions a string matches.

I can create iterate through all these regular expression and match them, but the list could be a huge one and this is a critical operation.

I can combine all these regular expressions into one (with | between them), but then the problem is that I can identify only the first matched regular expression, not all.

Another idea could be to create an automaton for all these regular expression and to mark the final states with, let's say, the indexes of the corresponding regular expression. I was looking at http://cs.au.dk/~amoeller/automaton/, a library which seems capable to work with regular expressions and automaton, but not sure it can be extended to solve my problem.

Do you have any other ideas?

To clarify some comments, I've added a code sample:

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class PatternTest {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("(a(?:b|c)a)|((?:a|b)ba)|(ab(?:a|c))");     
        Matcher m = p.matcher("aba");
        System.out.println(m.matches());
        System.out.println(m.groupCount());
        for (int i = 0, n = m.groupCount(); i < n; i++) {
            System.out.println(m.group(i));
        }
    }
}

will print out

true
3
aba
aba
null

As you can see only the first group is matched and I don't see a way to match the other two ones.

More findings - Using the automaton library above, the problem will reduce to the following: if you concatenate two or more automatons, how you can identify for a final state to which of the original automatons is corresponding?

Foi útil?

Solução 2

dk.brics.automaton does not directly support this, but you can generalize the representation of automata (and the relevant automata operations) to distinguish between different kinds of accept states. Start by adding an int field, for example, to the State class and use this field whenever 'accept' is set.

Outras dicas

I implemented such a solution based on dk.brics.automaton, you can find it here. https://github.com/fulmicoton/multiregexp

For a definitive answer (if there is one) we would need some more information, like:

  1. You say the list of regexes is huge; can you be more specific? Thousands? Millions? Billions and billions?

  2. Who wrote these regexes, and do they know what they're doing? Are the regexes thoroughly tested (for correctness and performance) before being added to the list?

  3. In your sample code you use the matches() method, which requires the regex to describe the whole string. It acts as if the regex were really
    \A(?:(a(?:b|c)a)|((?:a|b)ba)|(ab(?:a|c)))\z
    which matches "aba" but not "aaba" or "abaa". If you've used regexes in other tools or languages before coming to Java, this behavior might surprise you. Traditionally, a string has always been said to "match" a regex if the regex describes any substring within the string, even a zero-length substring. To get that behavior in Java, you have to use the Matcher's find() method.

  4. Are there any common factors you can pull out of all the regexes in the list, like minimum or maximum length, common substrings, or shared character subsets? For example, any string that matches one of your sample patterns must also match [abc]{3}. If there are, you might want to create filters based on them (maybe regex, maybe not) to run before the serious matching gets underway. (I wouldn't suggest this if you were using Perl, which is choc-a-bloc with optimizations like that already, but Java's not too proud to accept a little help. ☺)

But I feel pretty safe advising you to go with separate regexes rather than concatenating them all into one. The Frankenregex wouldn't necessarily perform better, and troubleshooting it would be a nightmare! You can pre-compile all the Pattern objects, and you can create a Matcher object ahead of time and reuse it for all the matches, like so:

m.reset(s).usePattern(p);

Here's a demo. I can't make any guarantees (you're still at the mercy of whoever wrote the regexes, for one thing), but if a solution is possible, I think this is the most promising approach.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top