Question

I've faced with strange behavior of java.util.regex.Matcher. Lets consider example:

    Pattern p = Pattern.compile("\\d*");
    String s = "a1b";
    Matcher m = p.matcher(s);
    while(m.find())
    {
        System.out.println(m.start()+" "+m.end());
    }

It produces output:

0 0
1 2
2 2
3 3

I can understant all lines except last. Matcher creates extra group (3,3) out of string. But javadoc for method start() confirms:

start() Returns the start index of the previous match.

The same case for dot-star pattern:

Pattern p = Pattern.compile(".*");
String s = "a1b";
Matcher m = p.matcher(s);
while(m.find())
{
    System.out.println(m.start()+" "+m.end());
}

Output:

0 3
3 3

But if specify line boundaries

Pattern p = Pattern.compile("^.*$");

The output will be "right":

0 3

Can someone explain me а reason of such behavior?

Was it helpful?

Solution

The pattern "\\d*" matches 0 or more digits. Same stands for ".*". It matches 0 or more occurrence of any character except newline.

The last match that you get is the empty string at the end of your string, after "b". The empty string satisfies the pattern \\d*. If you change the pattern to \\d+, you'll get expected result.

Similarly, the pattern .* matches everything from first character to last character. Thus it first matches "a1b". After that the cursor is after b: "a1b|". Now, matcher.find() again runs, and finds a zero-length string at the cursor, which satisifies the pattern .*, so it considers it as a match.

The reason why it gives expected output with "^.*$" is that the last empty string doesn't satisfy the ^ anchor. It is not at the beginning of the string, so it fails to match.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top