Question

I am trying to extract table names from a SQL script. I have the following string:

from db1.tableX tbx1 --quantity table
inner join tableY tbx2  on tbx1.xyz=tbx2.xyz

I don't want to match this string because tabley is not prefixed with a database name. The way I am trying to do this is by detecting the word "on" in the string before the database name.

My regex is matching tablex but also matches tbx1, which I dont want my expression to match.

I would like it to match only in this situation.

from db1.tableX tbx1 --quantity table
inner join db1.tableY tbx2 on tbx1.xyz = tbx2.xyz

My regex should give me tableX and tableY because they are prefixed by a database name.

Here is my expression:

(insert\s+into|from|inner\s+join|left\s+outer\s+join|join)\s+[\sa-zA-Z0-9_$#-]*\.\s*(?<table>[a-zA-Z0-9_]+)(?=\s+[a-zA-Z0-9_$#-]+)*
Was it helpful?

Solution

You can simplify your expression

(?<=\b(from|join)\s+[a-zA-Z0-9_$#-]*\.\s*)[a-zA-Z0-9_]+

It will yield the table name directly, without having to use named groups. I am using the pattern

(?<=prefix)find

This returns only the part "find" as match value. As prefix I use "from" or "join" followed by the database name the dot and possibly spaces.

\b is the beginning or end of a word.

The last part [a-zA-Z0-9_]+ is the table name.

It makes no difference whether you have an inner join, outer join etc., so I dropped this distinction.

Comments are ugly to detect, beacuse they can contain about anything and can appear about everywhere. Also you have two types of comments -- Until end of line and \* ... *\.

You could try to detect comments and spaces like this

(\s|--.*?$|\\\*.*?\*\\)+

Note that the backspace and the star have to be escaped, so \* becomes \\\* and *\ becomes \*\\.

$ denotes the end of line. The ? after .* ensures that not several comments are skipped at once, thus skipping meaningful text between them.

This will work in most cases; however, comment like structures could be included within string literals: SELECT 'hello -- world' AS greeting FROM .... This is not a comment! Without a full syntax analysis of the SQL text you will not be able to cover all possible cases. But you cannot do this with Regex. Regex has its limits.

OTHER TIPS

Step by step:

1) http://regexr.com?33tto

I don't know why you've put that lookahead at the end of the match. I don't think you need it, so I removed it.

2) http://regexr.com?33ttr

Now we get to the problem that you've indicated. The problem is that in your database name matching [\sa-zA-Z0-9_$#-]* you've included whitespace as a valid char. I've removed that too.

3) http://regexr.com?33ttu

Is this the desired outcome?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top