Question

I'm writing a simplified SQL parser that's using regexes to match each valid command. I'm stuck on matching the following:

attribute1 type1, attribute2 type2, attribute3 type3, ...

Where attributes are names of table columns, and types can be a CHAR(size), INT, or DEC. This is used in a CREATE TABLE statement:

CREATE TABLE student (id INT, name CHAR(20), gpa DEC);

To debug it, I'm trying to match this:

id INT, name CHAR(20), gpa DEC

with this:

(?<attributepair>[A-Za-z0-9_]+ (INT|(CHAR\([0-9]{1,3}\))|DEC))(, \k<attributepair>)*

I even tried it without naming the backreference:

([A-Za-z0-9_]+ (INT|(CHAR\([0-9]{1,3}\))|DEC))(, \1)*

I tested the latter regex expression with regexpal and it matched, but both don't when I try it in my Java program. Is there something I'm missing? How can I make this work? Perhaps this has something to do with how I'm calling Pattern.compile(), like if I'm missing a flag or not. I'm also have JDK v7.

Update: I've found that although matches() returns false, lookingAt() and find() return true. It's matching each individual attribute. I want to craft my regex so it matches the whole expression rather than each attribute.

Was it helpful?

Solution

There is no "match as many time as possible and join all the groups together" in Java.
You either have to do it yourself using:

while(matcher.find()) {
    // ...
}

... or using a regex that already matches everything in a single call to find.


For example, you could try the following regex (as Java String) instead, which will match all your attributes at once.

(?:\\w+ (?:INT|CHAR(?:\\(\\d{1,3}\\))?|DEC)(?:, )?)+

Here is a working example:

final String str = "CREATE TABLE student (id INT, name CHAR(20), gpa DEC);";
final Pattern p = Pattern.compile("(?:\\w+ (?:INT|CHAR(?:\\(\\d{1,3}\\))?|DEC)(?:, )?)+");
final Matcher m = p.matcher(str);
if(m.find()) {
    System.out.println(m.group());  // prints "id INT, name CHAR(20), gpa DEC"
};

Output:

id INT, name CHAR(20), gpa DEC

OTHER TIPS

When you do something like ([A-Za-z0-9_]+ (INT|(CHAR\([0-9]{1,3}\))|DEC))(, \1)* the backreference is for what the first group actually matched.

Ie, id INT, id INT, name CHAR(20), gpa DEC would work with the backreference in the sense that id INT, id INT would become part of the same match. (If you stick that in regexpal you'll see the difference quite clearly based on the highlights.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top