Regex to String generation

https://softwareengineering.stackexchange.com/questions/243347

04-10-2020
|

Question

Let's say that we have a regex and an index i. If we suppose that the set of strings that match our regex are sorted in a lexicographical order, how could we get the i element of this list?

Edit : I added this simple example for more explanation:

input :

regex="[a-c]{1,2}";
index=4;

In this case the ordered list of strings that this regex matches contains the following elements:

a
aa
ab
ac
b
ba
bb
bc
c
ca
cb
cc

output :

The 4th element, which is ac

ps: It's known that a given regex could match an infinite number of strings. This shouldn't have an impact on the process of extracting the element in the given finite index.

Solution 3

kevin's answer was a good start to know how we can deal with regular expression from a theoritical point of view, which would help building a java library that let users generate String based on a given regex . Here is a what i've build , a java library that provide many features for using regex to generate String (random generation ,generate String based on it's index, generate all String..) check it out here .

Example :

    Generex generex = new Generex("[0-3]([a-c]|[e-g]{1,2})");

    // generate the second String in lexicographical order that match the given Regex.
    String secondString = generex.getMatchedString(2);
    System.out.println(secondString);// it print '0b'

    // Generate all String that matches the given Regex.
    List<String> matchedStrs = generex.getAllMatchedStrings();

    // Using Generex iterator
    Iterator iterator = generex.iterator();
    while (iterator.hasNext()) {
        System.out.print(iterator.next() + " ");
    }
    // it print 0a 0b 0c 0e 0ee 0e 0e 0f 0fe 0f 0f 0g 0ge 0g 0g 1a 1b 1c 1e
    // 1ee 1e 1e 1f 1fe 1f 1f 1g 1ge 1g 1g 2a 2b 2c 2e 2ee 2e 2e 2f 2fe 2f 2f 2g
    // 2ge 2g 2g 3a 3b 3c 3e 3ee 3e 3e 3f 3fe 3f 3f 3g 3ge 3g 3g 1ee

    // Generate random String
    String randomStr = generex.random();
    System.out.println(randomStr);// a random value from the previous String list

OTHER TIPS

A regular expression maps directly to a finite state machine. Generating strings from the FSM is straightforward with a backtracking algorithm. The example is for the regular expression /aa?b/

(start) 0 -- a --> 1 -- a --> 2 -- b --> 3 (accepting and terminal)
                   |                     ^
                   |---------b ----------|

State   Previous char   Input char    Input string     History
  0        null                           --             []
                            a              a         
  1        null                                        [{a,1}]
                            a              aa          [{a,1},{a,2}]
  2        null
                            b             aab          [{a,1},{a,2},{b,3}]
  3        null
         Accepting state, emit 'aab' as valid input
         Terminal state, backtrack
  2          b                             aa          [{a,1},{a,2}]
         No further valid inputs, backtrack
  1          a                             a           [{a,1}]
                            b              ab
  3        null                                        [{a,1},{b,3}]
         Accepting state, emit 'ab' as valid input
         Terminal state, backtrack
  1          b                             a           [{a,1}]
         No further valid inputs, backtrack
  0          a                             --          []
         No further valid inputs, stack empty, terminate

On their own, regexes are serving the purpose of answering the question: "Given this list of strings as input and this pattern, which subset mathces it?". It does not work backwards i.e. from this regex, start generating strings that match it.

Having said that, take a look at this: https://code.google.com/p/xeger/, that is supposed to do what you want to achieve (with some limitations as described in https://code.google.com/p/xeger/wiki/XegerLimitations).

If your regex is accepted, I guess the anwer to your question is pretty straightforward i.e. generate the list and take the item at index i

I would like to propose this answer just in case it can help you some how. In case there are some details or some small errors, please leave comment. I will try to correct.

First look at the minimum finite state automata FSA of the regex
Find the first matching string of the FSA. This can be done by creating an auxiliary FSA in which you remove all paths leading to the reject state, then start walking through the auxiliary FSA. At each transition, choose the first character from the input alphabet to make sure you generate the first matching string.
Now we need to get the second matching string of the original regex, so we modify the auxiliary FSA so that in the final state (before the accep state), the last character input (of the first matching string in 2 above) would not lead to the accept state. This can be done by removing an edge (and a state) or modify the edge leading to the accepting state.
Generate the first matching string of the modified auxiliary FSA, this would give you the second matching string of the original regex
Repeat modifying the auxiliary FSA until you get to the i-th matching string. This would be the i-the matching string of the original regex.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange