Question

How would you go about creating a random alpha-numeric string that matches a certain regular expression?

This is specifically for creating initial passwords that fulfill regular password requirements.

Was it helpful?

Solution 9

Use the accepted answer at Generating Random Passwords until it matches your regexp.

OTHER TIPS

Welp, just musing, but the general question of generating random inputs that match a regex sounds doable to me for a sufficiently relaxed definition of random and a sufficiently tight definition of regex. I'm thinking of the classical formal definition, which allows only ()|* and alphabet characters.

Regular expressions can be mapped to formal machines called finite automata. Such a machine is a directed graph with a particular node called the final state, a node called the initial state, and a letter from the alphabet on each edge. A word is accepted by the regex if it's possible to start at the initial state and traverse one edge labeled with each character through the graph and end at the final state.

One could build the graph, then start at the final state and traverse random edges backwards, keeping track of the path. In a standard construction, every node in the graph is reachable from the initial state, so you do not need to worry about making irrecoverable mistakes and needing to backtrack. If you reach the initial state, stop, and read off the path going forward. That's your match for the regex.

There's no particular guarantee about when or if you'll reach the initial state, though. One would have to figure out in what sense the generated strings are 'random', and in what sense you are hoping for a random element from the language in the first place.

Maybe that's a starting point for thinking about the problem, though!

Now that I've written that out, it seems to me that it might be simpler to repeatedly resolve choices to simplify the regex pattern until you're left with a simple string. Find the first non-alphabet character in the pattern. If it's a *, replicate the preceding item some number of times and remove the *. If it's a |, choose which of the OR'd items to preserve and remove the rest. For a left paren, do the same, but looking at the character following the matching right paren. This is probably easier if you parse the regex into a tree representation first that makes the paren grouping structure easier to work with.

To the person who worried that deciding if a regex actually matches anything is equivalent to the halting problem: Nope, regular languages are quite well behaved. You can tell if any two regexes describe the same set of accepted strings. You basically make the machine above, then follow an algorithm to produce a canonical minimal equivalent machine. Do that for two regexes, then check if the resulting minimal machines are equivalent, which is straightforward.

String::Random in Perl will generate a random string from a subset of regular expressions:

#!/usr/bin/perl

use strict;
use warnings;

use String::Random qw/random_regex/;

print random_regex('[A-Za-z]{3}[0-9][A-Z]{2}[!@#$%^&*]'), "\n";

If you have a specific problem, you probably have a specific regular expression in mind. I would take that regular expression, work out what it means in simple human terms, and work from there.

I suspect it's possible to create a general regex random match generator, but it's likely to be much more work than just handling a specific case - even if that case changes a few times a year.

(Actually, it may not be possible to generate random matches in the most general sense - I have a vague memory that the problem of "does any string match this regex" is the halting problem in disguise. With a very cut-down regex language you may have more luck though.)

I have written Parsley, which consist of a Lexer and a Generator.

  • Lexer is for converting a regular expression-like string into a sequence of tokens.
  • Generator is using these tokens to produce a defined number of codes.
$generator = new \Gajus\Parsley\Generator();

/**
 * Generate a set of random codes based on Parsley pattern.
 * Codes are guaranteed to be unique within the set.
 *
 * @param string $pattern Parsley pattern.
 * @param int $amount Number of codes to generate.
 * @param int $safeguard Number of additional codes generated in case there are duplicates that need to be replaced.
 * @return array
 */
$codes = $generator->generateFromPattern('FOO[A-Z]{10}[0-9]{2}', 100);

The above example will generate an array containing 100 codes, each prefixed with "FOO", followed by 10 characters from "ABCDEFGHKMNOPRSTUVWXYZ23456789" haystack and 2 numbers from "0123456789" haystack.

This PHP library looks promising: ReverseRegex

Like all of these, it only handles a subset of regular expressions but it can do fairly complex stuff like UK Postcodes:

([A-PR-UWYZ]([0-9]([0-9]|[A-HJKSTUW])?|[A-HK-Y][0-9]([0-9]|[ABEHMNPRVWXY])?) ?[0-9][ABD-HJLNP-UW-Z]{2}|GIR0AA)

Outputs

D43WF
B6 6SB
MP445FR
P9 7EX
N9 2DH
GQ28 4UL
NH1 2SL
KY2 9LS
TE4Y 0AP

You'd need to write a string generator that can parse regular expressions and generate random members of character ranges for random lengths, etc.

Much easier would be to write a random password generator with certain rules (starts with a lower case letter, has at least one punctuation, capital letter and number, at least 6 characters, etc) and then write your regex so that any passwords created with said rules are valid.

Presuming you have both a minimum length and 3-of-4* (or similar) requirement, I'd just be inclined to use a decent password generator.

I've built a couple in the past (both web-based and command-line), and have never had to skip more than one generated string to pass the 3-of-4 rule.

  • 3-of-4: must have at least three of the following characteristics: lowercase, uppercase, number, symbol

It is possible (for example, Haskell regexp module has a test suite which automatically generates strings that ought to match certain regexes).

However, for a simple task at hand you might be better off taking a simple password generator and filtering its output with your regexp.

Why not work the regexp backwards? A simple example: if your regexp is

/[a-zA-Z]{6}/

then you know you need 6 letters a-z or A-Z, so generate them. This can get fancier, of course and, depending on your needs, you may end up reverse-writing an entire regexp parser, but you can stop adding features when you've fulfilled your need.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top