Regex help: My regex pattern will match invalid strings
-
22-09-2019 - |
Question
The text string I want to validate consists of what I call "segments". A single segment might look like this:
[A-Z,S,3]
So far I managed to build this regex pattern
(?:\[(?<segment>[^,\]\[}' ]+?,[S|D],\d{1})\])+?
it works but it will return matches even though the whole text string contains invalid text. I guess I need to use ^
and $
somewhere in my pattern but I can't figure out how!?
I would like my pattern to produce the following results:
[A-Z,S,3][A-Za-z0-9åäöÅÄÖ,D,4]
OK(two segments)[A-Z,S,3]aaaa[A-Za-z0-9åäöÅÄÖ,D,4]
No matchcrap[A-Z,S,3][A-Za-z0-9åäöÅÄÖ,D,4]
No match[A-Z,S,3][]
No match[A-Z,S,3][klm,D,4][0-9,S,1]
OK(three segments)
Solution
Use ^ to anchor the start and $ to anchor the end. E.g.: ^(abc)*$
, this matches zero or more repetitions of the group ("abc" in this example) and that must start at the start of the input string and end at the end of it.
^(?:[(?[^,][}' ]+?,[S|D],\d{1})])+$
—using an ungreedy +?
doesn't matter, as you require it to match until the end anyway. However, your regex has a few issues.
^(?:\[[^,]+,[SD],\d\])+$
—seems more like what you want.
- I couldn't decipher what you meant by the first part, so my regex is more general than required,
[^,]+,
will match any sequence of non-commas followed by a comma, and in fact you should probably add]
to this negated character class. [S|D]
is a character class of three characters, as|
doesn't mean alternation here ((S|D)
would mean the same as[SD]
though).{1}
is the default for any atom, you don't need to specify it.
Pseudocode (run it at codepad.org):
import re
def find_segments(input_string):
results = []
regex = re.compile(r"\[([^],]+),([SD]),(\d)\]")
start = 0
while True:
m = regex.match(input_string, start)
if not m: # no match
return None # whole string didn't match, do another action as appropriate
results.append(m.group(1, 2, 3))
start = m.end(0)
if start == len(input_string):
break
return results
print find_segments("[A-Z,S,3][klm,D,4][0-9,S,1]")
# output:
#[('A-Z', 'S', '3'), ('klm', 'D', '4'), ('0-9', 'S', '1')]
The big difference here is the expression matches only the complete [...]
part, but it is applied in succession, so they must start again where the last ends (or end at the end of the string).
OTHER TIPS
You want something like this:
/^(\[[^],]+,[SD],\d\])+$/
Here is an example of how you could use this regular expression in C#:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string[] tests = {
"[A-Z,S,3][A-Za-z0-9,D,4]",
"[A-Z,S,3]aaaa[A-Za-z0-9,D,4]",
"crap[A-Z,S,3][A-Za-z0-9,D,4]",
"[A-Z,S,3][]",
"[A-Z,S,3][klm,D,4][0-9,S,1]"
};
string segmentRegex = @"\[([^],]+,[SD],\d)\]";
string lineRegex = "^(" + segmentRegex + ")+$";
foreach (string test in tests)
{
bool isMatch = Regex.Match(test, lineRegex).Success;
if (isMatch)
{
Console.WriteLine("Successful match: " + test);
foreach (Match match in Regex.Matches(test, segmentRegex))
{
Console.WriteLine(match.Groups[1]);
}
}
}
}
}
Output:
Successful match: [A-Z,S,3][A-Za-z0-9,D,4]
A-Z,S,3
A-Za-z0-9,D,4
Successful match: [A-Z,S,3][klm,D,4][0-9,S,1]
A-Z,S,3
klm,D,4
0-9,S,1