Question

I've just seen a huge regex for Java that made me think a little about maintainability of regular expressions in general. I believe that most people - except some badass perl mongers - would agree that regular expressions are hardly maintainable.

I was thinking about how this situation could be fixed. So far, the most promising idea I have is using a fluent interface. To give an example, instead of:

Pattern pattern = Pattern.compile("a*|b{2,5}");

one could write something like this

import static util.PatternBuilder.*

Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile();

Pattern alternative = 
  or(
    string("a").anyTimes(),
    string("b").times(2,5)
  )
  .compile();

In this very short example, the common way of creating regular expression is still quite readable to any mediocre talented developer. However, think about those spooky expressions that fill two or more lines with 80 characters each. Sure, the (verbose) fluent interface would require several lines instead of only two, but I'm sure it would be much more readable (hence maintainable).

Now my questions:

  1. Do you know of any similar approach to regular expressions?

  2. Do you agree that this approach could be better than using simple strings?

  3. How would you design the API?

  4. Would you use such a neat utility in your projects?

  5. Do you think this would be fun implementing? ;)

EDIT: Imagine that there could be methods that are on a higher level than simple constructs we all no from regex, e.g.

// matches aaaab@example.com - think of it as reusable expressions
Pattern p = string{"a").anyTimes().string("b@").domain().compile();

EDIT - short summary of comments:

It's interesting to read that most people think that regular expressions are here to stay - although it takes tools to read them and smart guys to think of ways to make them maintainable. While I'm not sure that a fluent interface is the best way to go, I'm sure that some smart engineers - we? ;) - should spend some time to make regular expressions a thing of the past - it's enough that they've been with us for 50 years know, don't you think?

OPEN BOUNTY

The bounty will be awarded to the best idea (no code required) for a new approach to regular expressions.

EDIT - A NICE EXAMPLE:

here's the kind of pattern I'm talking about - extra kudos to the first one who's able to translate it - RegexBuddies allowed (it's from an Apache project btw) extra kudos awarded to chii and mez: it's a RFC compliant email address validation pattern - although its RFC822 (see ex-parrot.com), not 5322 - not sure if there is a difference though - if yes, I'll award next extra kudos for a patch to fit 5322 ;)

private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:"
    + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:("
    + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ "
    + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0"
    + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\"
    + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+"
    + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:"
    + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)"
    + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\"
    + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?["
    + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)"
    + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]"
    + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?["
    + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*"
    + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)"
    + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+"
    + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r"
    + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:"
    + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t"
    + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031"
    + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\]("
    + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?"
    + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?"
    + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?"
    + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?"
    + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|"
    + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>"
    + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\""
    + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?"
    + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\["
    + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-"
    + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|("
    + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;"
    + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[(["
    + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\""
    + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\"
    + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\"
    + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\"
    + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]"
    + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0"
    + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\"
    + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,"
    + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?"
    + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:["
    + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]"
    + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*("
    + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:("
    + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=["
    + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t"
    + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t"
    + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?"
    + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|"
    + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:"
    + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\"
    + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)"
    + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\""
    + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)"
    + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>"
    + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?["
    + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,"
    + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:"
    + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\["
    + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])"
    + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])"
    + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\"
    + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:("
    + "?:\\r\\n)?[ \\t])*))*)?;\\s*)";
Was it helpful?

Solution

How would you design the API?

I would borrow a page from the Hibernate criteria API. Instead of using:

string("a").anyTimes().or().string("b").times(2,5).compile()

Use a pattern like:

Pattern.or(Pattern.anyTimes("a"), Pattern.times("b", 2, 5)).compile()

This notation is a bit more concise, and I feel that it's easier to understand the hierarchy/structure of the pattern. Each method could accept either a string, or a pattern fragment, as the first argument.

Do you know of any similar approach to regular expressions?

Not offhand, no.

Do you agree that this approach could be better than using simple strings?

Yes, absolutetely... if you are using regex's for anything remotely complicated. For very short and simple cases, a string is more convenient.

Would you use such a neat utility in your projects?

Possibly, as it became proven/stable... rolling it into a larger utility project like Apache Commons might be a plus.

Do you think this would be fun implementing? ;)

+1

OTHER TIPS

Martin Fowler suggest another strategy. Namely taking meaningful parts of the regex and replace them by variables. He uses following example:

 "^score\s+(\d+)\s+for\s+(\d+)\s+nights?\s+at\s+(.*)" 

becomes

   String scoreKeyword = "^score\s+";
   String numberOfPoints = "(\d+)";
   String forKeyword = "\s+for\s+";
   String numberOfNights = "(\d+)";
   String nightsAtKeyword = "\s+nights?\s+at\s+";
   String hotelName = "(.*)";

   String pattern = scoreKeyword + numberOfPoints +
      forKeyword + numberOfNights + nightsAtKeyword + hotelName;

Which is much more readable and maintainable.

The goal of the previous example was to parse numberOfPoints, numberOfNights and hotelName out of a List of Strings like:

score 400 for 2 nights at Minas Tirith Airport

It might be slightly easier for someone without any regex experience, but after someone learns your system, he still won't be able to read a normal regex elsewhere.

Also, I think your version is harder to read for a regex expert.

I would recommend annotating the regex like this:

Pattern pattern = Pattern.compile(
  "a*     # Find 0 or more a        \n" +
  "|      # ... or ...              \n" +
  "b{2,5} # Find between 2 and 5 b  \n",
Pattern.COMMENTS);

This is easy to read for any experience level, and for the inexperienced, it teaches regex at the same time. Also, the comments can be tailored to the situation to explain the business rules behind the regex rather than just the structure.

Also, a tool like RegexBuddy can take your regex and translate it into:

Match either the regular expression below (attempting the next alternative only if this one fails) «a*»
   Match the character "a" literally «a*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «b{2,5}»
   Match the character "b" literally «b{2,5}»
      Between 2 and 5 times, as many times as possible, giving back as needed (greedy) «{2,5}»

This is an intriguing concept, but as it is presented there are a few flaws.

But first answers to the key questions asked:

Now my questions:

1. Do you know of any similar approach to regular expressions?

None that haven't been mentioned already. And those I found out about by reading the question and answers.

2. Do you agree that this approach could be better than using simple strings?

If it works as advertised, it would definitely makes things much easier to debug.

3. How would you design the API?

See my notes in the next section. I take your examples and the linked .NET library as a starting point.

4. Would you use such a neat utility in your projects?

Undecided. I have no problem working with the current version of cryptic regularly expressions. And I would need a tool to convert existing regular expressions to the fluent language version.

5. Do you think this would be fun implementing? ;)

I'd enjoy working out the higher level how, than writing the actual code. Which explains the wall of text that is this answer.


Here are a couple of problems I noticed, and the way I would deal with it.

Unclear structure.

Your example seems to create a regex by concatenating on to strings. This is not very robust. I believe that these methods should be added to String and Patern/Regex objects, because it will make implementation and code cleaner. Besides it's similar to the way Regular Expressions are classically defined.

Just because I can't see it working any other way, the rest of my annotations to the proposed scheme will assume that all methods act on and return Pattern objects.

Edit: I seem to have used the following conventions throughout. So I've clarified them and move them up here.

  • Instance methods: pattern augmentation. eg: capturing, repetition, look around assertions.

  • Operators: order of operations. alternation, concatenation

  • Constants: character classes, boundaries (inplace of \w, $, \b, etc)

How will capturing/clustering be handled?

Capturing is a huge part of regular expressions.

I see each Pattern object being stored internally as a cluster. (?: pattern) in Perl terms. Allowing for pattern tokens to easily be mixed and mingled without interfering with other pieces.

I expect capturing to be done as an instance method on a Pattern. Taking a variable to store the matching string[s] in.

pattern.capture(variable) would store pattern in variable. In the event that the capture is part of an expression to be matched multiple times, variable should contain an array of strings of all matches for pattern.

Fluent languages can be very ambiguous.

Fluent languages are not well suited to the recursive nature of regular expressions. So consideration needs to be given to the order of operations. Just chaining methods together does not allow for very complex regular expressions. Exactly the situation where such a tool would be useful.

Does

Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile();

produce /a*|b{2,5}/ or /(a*|b){2,5}/ ?

How would such a scheme handle nested alternation? Eg: /a*|b(c|d)|e/

I see three ways of handling alternation in regular expressions

  1. As an operator: pattern1 or pattern2 => pattern # /pattern1|pattern2/
  2. As a class method: Pattern.or( pattern1, pattern2[, pattern3]*) => pattern # /pattern1|patern2|patern3|...|/
  3. As a instance method: pattern1.or(pattern2) => pattern # /pattern1|patern2/

I would handle concatenation the same way.

  1. As an operator: pattern1 + pattern2 => pattern # /pattern1pattern2/
  2. As a class method: Pattern.concatenate( pattern1, pattern2[, pattern3]*) => pattern # /pattern1patern2patern3.../
  3. As a instance method: pattern1.then(pattern2) => pattern # /pattern1patern2/

How to extend for commonly used patterns

The proposed scheme used .domain() which seems to be a common regex. To treat user defined patterns as a methods doesn't make it easy to add new patterns. In a language like Java users of the library would have to override the class to add methods for commonly used patterns.

This is addressed by my suggestion to treat every piece as an object. A pattern object can be created for each commonly used regex like matching a domain for example. Given my previous thoughts about capturing it wouldn't too be hard to ensure that capturing works for multiple copies of the same common pattern that contains a captured section.

There should also be constants for patterns matching various character classes.

Zero width look around assertions

Expanding on my thoughts that all pieces should be implicitly clustered. Look around assertions shouldn't be too hard too do with an instance method.

pattern.zeroWidthLookBehind() would produce (?<patten).


Things that still need to be considered.

  • Backreferences: Hopefully not too hard with the named capturing discussed earlier
  • How to actually implement it. I haven't given the internals too much thought. It's where the real magic is going to happen.
  • Translation: There really should be a tool translate to and from classic regexs (say Perl dialect) and the new scheme. Translating from the new scheme could be a part of the package

Putting it all together, my proposed version of a pattern that matches an email address:

Pattern domain_label = LETTER_CHARACTER + (LETTER_CHARACTER or "-" or DIGIT_CHARACTER).anyTimes()
Pattern domain = domain_label + ("." + domain_label).anyTimes()
Pattern pattern = (LETTER_CHARACTER + ALPHANUMERIC_CHARACTER + "@" + domain).compile

In hindsight, my scheme borrows heavily from Martin Fowler's approach in use. Although I didn't intend for things to go this way, it definitely makes using such a system more maintainable. It also addresses a problem or two with Fowler's approach (capturing order).

My own humble attempt can be found on GitHub. Although I don't think it's worth using for simple expressions, it does provide a couple of advantages apart from the readability improvement:

  • It takes care of bracket matching
  • It handles escaping of all the 'special' characters which can quickly lead to backslash hell

A few simple examples:

 // Matches a single digit
    RegExBuilder.build(anyDigit()); // "[0-9]"

 // Matches exactly 2 digits
    RegExBuilder.build(exactly(2).of(anyDigit())); // "[0-9]{2}"

 // Matches between 2 and 4 letters
    RegExBuilder.build(between(2,4).of(anyLetter())); // "[a-zA-Z]{2,4}"

And a more complicated one (which more or less validates email addresses):

final Token ALPHA_NUM = anyOneOf(range('A','Z'), range('a','z'), range('0','9'));
final Token ALPHA_NUM_HYPEN_UNDERSCORE = anyOneOf(characters('_','-'), range('A','Z'), range('a','z'), range('0','9'));

String regexText = RegExBuilder.build(
 // Before the '@' symbol we can have letters, numbers, underscores and hyphens anywhere
    oneOrMore().of(
        ALPHA_NUM_HYPEN_UNDERSCORE
    ),
    zeroOrMore().of(
        text("."), // Periods are also allowed in the name, but not as the initial character
        oneOrMore().of(
            ALPHA_NUM_HYPEN_UNDERSCORE
        )
    ),
    text("@"),
 // Everything else is the domain name - only letters, numbers and periods here
    oneOrMore().of( 
        ALPHA_NUM
    ),
    zeroOrMore().of(
        text("."), // Periods must not be the first character in the domain
        oneOrMore().of(
            ALPHA_NUM
        )
    ),
    text("."), // At least one period is required
    atLeast(2).of( // Period must be followed by at least 2 letters (this is the TLD)
        anyLetter()
    )
);

There's a fluent regexps library for .NET.

Short answer: I have seen it approached from a linting and compiling angle, which I think is something to consider for this.

Long answer: The company I work for makes hardware-based regular expression engines for enterprise content filtering applications. Think running anti-virus or firewall applications at 20GB/sec speeds right in the network routers rather than taking up valuable server or processor cycles. Most anti-virus, anti-spam or firewall apps are a bunch of regex expressions at the core.

Anyway, the way the regex's are written has a huge impact on the performance of the scanning. You can write regexs in several different ways to do the same thing, and some will have drastically faster performance. We've written compilers and linters for our customers to help them maintain and tune their expressions.

Back to the OP's question, rather than defining a whole new syntax, I would write a linter (sorry, ours is proprietary) cut and paste regex's into that will break down legacy regex's and output "fluent english" for someone to understand better. I'd also add relative performance checks and suggestions for common modifications.

The short answer, for me, is that, once you get to regular expressions (or other pattern matching that does the same thing) that are long enough to cause a problem... you should probably be considering if they're the right tool for the job in the first place.

Honestly, any fluent interface seems like it would be harder to read than a standard regular expression. For really short expressions, the fluent version is verbose, but not too long; it's readable. But so is the regular expression for something that long.

For a medium sized regular expression, a fluent interface becomes unwieldy; long enough that it's hard, if not impossible, to read.

For a long regular expression (ie, the email address one), where the regular expression is actually hard (if not impossible) to read, the fluent version became impossible to read 10 pages ago.

Do you know of any similar approach to regular expressions?

No, except for the previous answer

Do you agree that this approach could be better than using simple strings?

Sort of - I think instead of single characters to represent constructs, we could use more descriptive markup, but I doubt it would make complex logic any clearer.

How would you design the API?

Translate each construct into a method name, and allow nested function calls so that it's very easy to take a string and substitute method names into it.

I think the majority of the value will be in defining a robust library of utility functions, like matching "emails", "phone numbers", "lines that don't contain X", etc. that can be customized.

Would you use such a neat utility in your projects?

Maybe - but probably only for the longer ones, where it would be easier to debug function calls than debugging string editing, or where there is a nice utility function ready to use.

Do you think this would be fun implementing? ;)

Of course!

In answer to the last part of the question (for Kudos)

private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:"
    + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:("
    + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ "
    + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0"
    + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\"
    + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+"
    + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:"
    + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)"
    + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\"
    + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?["
    + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)"
    + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]"
    + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?["
    + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*"
    + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)"
    + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+"
    + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r"
    + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:"
    + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t"
    + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031"
    + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\]("
    + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?"
    + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?"
    + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?"
    + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?"
    + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|"
    + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>"
    + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\""
    + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?"
    + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\["
    + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-"
    + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|("
    + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;"
    + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[(["
    + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\""
    + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\"
    + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\"
    + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\"
    + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]"
    + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0"
    + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\"
    + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,"
    + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?"
    + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:["
    + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]"
    + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*("
    + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:("
    + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=["
    + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t"
    + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t"
    + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?"
    + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|"
    + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:"
    + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\"
    + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)"
    + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\""
    + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)"
    + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>"
    + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?["
    + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,"
    + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:"
    + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\["
    + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])"
    + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])"
    + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\"
    + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:("
    + "?:\\r\\n)?[ \\t])*))*)?;\\s*)";

matches RFC compliant email addresses :D

A regular expression is a description of a finite state machine. The classic textual representation isn't necessarily bad. It is compact, it is relatively unambigous and it is fairly well adopted.

It MAY be that a better representation would be a state transition diagram, but that would probably be hard to use in source code.

One possibility would be to build it from a variety of container and combiner objects.

Something along the line of the following (turning this from pseudo-code to language of choice is left as an exercise for the eager):

domainlabel = oneormore(characterclass("a-zA-Z0-9-"))
separator = literal(".")
domain = sequence(oneormore(sequence(domainlabel, separator)), domainlabel)
localpart = oneormore(characterclassnot("@"))
emailaddress = sequence(localpart, literal("@"), domain)

Note that the above will incorrectly classify arbritarily many email addresses as being valid that do NOT conform to the grammar they're required to follow, as that grammar requires more than a simple FSM for full parsing. I don't believe it'd misclassify a valid address as invalid, though.

It should correspond to [^@]+@([a-zA-Z0-9-]+.)+.([a-zA-Z0-9-]+)

4. Would you use such a neat utility in your projects?

I would most likely not. I think this is a case of using the right tool for the job. There are some great answers here such as: 1579202 from Jeremy Stein. I have recently "gotten" regular expressions on a recent project and found them to be very useful just as they are, and when properly commented they are understandable if you know the syntax.

I think the "knowing the syntax" part is what turns people off to Regular Expressions, to those who don't understand they look arcane and cryptic, but they are a powerful tool and in applied Computer Science (e.g. writing software for a living) I feel like intelligent professionals should and should be able to learn to use them and us them appropriately.

As they say "With great power comes great responsibility." I have seen people use regular expressions everywhere for everything, but used judiciously by someone who has taken the time to learn the syntax thoroughly, they are incredibly helpful; to me, adding another layer would in a way defeat their purpose, or at a minimum take away their power.

This just my own opinion and I can understand where people are coming from who would desire a framework like this, or who would avoid regular expressions, but I have hard time hearing "Regular Expressions are bad" from those who haven't take the time to learn them and make an informed decision.

To make everybody happy (regex masters and fluid interface proponents), make sure the fluid interface can output an appropriate raw regex pattern, and also take a regular regex using a Factory method and generate fluid code for it.

What you are looking for can be found here:. It is a regular expression buillder which follows the Wizard Design Pattern

I recently had this same idea.

Thought of implementing it myself, but then I found VerbalExpressions.

Let's compare: I have worked often with (N)Hibernate ICriteria queries, which can be considered a Fluent mapping to SQL. I was (and still am) enthusiastic about them, but did they make the SQL queries more legible? No, more to the contrary, but another benefit rose: it became much easier to programmatically build statements, to subclass them and create your own abstractions etc.

What I'm getting at is that using a new interface for a given language, if done right, can prove worthwhile, but don't think too highly of it. In many cases it won't become easier to read (nested subtraction character classes, Captures in look-behind, if-branching to name a few advanced concepts that will be hard to combine fluently). But in just as many cases, the benefits of greater flexibility outweigh the added overhead of syntax complexity.

To add to your list of possible alternative approaches and to take this out of the context of only Java, consider the LINQ syntax. Here's what it could look like (a bit contrived) (from, where and select are keywords in LINQ):

// for ^str(aa|bb){3}
from part in mystring
where part startswith "str"
and part hasgroups "aa" or "bb" as first    /* "aa" or "bb" in group 'first' */
and part repeats first 3                    /* repeat group 'first' 3 times */
select part + "extra"                       /* can contain complete statement block */

just a rough idea, I know. The good thing of LINQ is that it is checked by the compiler, a kind-of language in a language. By default, LINQ can also be expressed as fluent chained syntax, which makes it, if well designed, compatible with other OO languages.

I say go for it, I'm sure it's fun to implement.

I suggest using a query model (similar to jQuery, django ORM), where each function returns a query object, so you can chain them together.

any("a").some("b").one("@").some(chars).one(".").some(chars) //a*b+@\w+\.\w+

where chars is predefined to fit any character.

or can be achieved by using choices:

any("a").choice("x", "z") // a(x|z)

The argument to each function can be a string or another query. For example, the chars variable mentioned above can be defined as a query:

//this one is ascii only
chars = raw("a-zA-Z0-9")

And so, you can have a "raw" function that accepts a regex string as input if it feels to cumbersome to use the fluent query system.

Actually, these functions can just return raw regex, and chaining them is simply concatenating these raw regex strings.

any("a") ---> "a*"
raw("b+") ----> "b+"
one(".") ---> "\."
choice("a", "b") ----> (a|b)

I am not sure that replacing regexp with a fluent API would bring much.

Note that I am not a regexp wizard (I have to re-read the doc nearly every time I need to create a regexp).

A fluent API would make any medium-complexity regexp (let's say ~50 characters) even more complex than required and not easier to read in the end, although it may improve the creation of a regexp in an IDE, thanks to code completion. But code maintenance generally represents a higher cost than code development.

In fact, I am not even sure it would be possible to have an API smart enough to really provide enough guidance to the developer when creating a new regexp, not talking about ambiguous cases, as mentioned in a previous answer.

You mentioned a regexp example for an RFC. I am 99% sure (there is still 1% hope;-)) that any API would not make that example any simpler, but conversely that would only make it more complex to read! That's a typical example where you don't want to use regexp anyway!

Even regarding regexp creation, due to the problem of ambiguous patterns, it is probable that with a fluent API, you would never come with the right expression the first time, but would have to change several times until you get what you really want.

Make no mistake, I do love fluent interfaces; I have developed some libraries that use them, and I use several 3rd-party libraries based on them (e.g. FEST for Java testing). But I don't think they can be the golden hammer for any problem.

If we consider Java exclusively, I think the main problem with regexps is the necessary escaping of backslashes in Java string constants. That's one point that makes it incredibly difficult to create and understand regexp in Java. Hence, the first step to enhance Java regexp would, for me, be a language change, a la Groovy, where string constants don't need to escape backslashes.

So far that would be my only proposal to improve regexp in Java.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top