Question

Directly from this java API (ctrl + f) + "Group name":

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

I know how capturing groups work and how they work with backreference. However I have not got the point of the API bit I above quoted. Is somebody able to put it down in other words?

Thanks in advance.

Était-ce utile?

La solution

That quote says that:

If you have used a quantifier - +, *, ? or {m,n}, on your capture group, and your group is matched more than once, then only the last match will be associated with the capture group, and all the previous matches will be overridden.

For e.g.: If you match (a)+ against the string - "aaaaaa", your capture group 1 will refer to the last a.

Now consider the case, where you have a nested capture group as in the example shown in your quote:

`(a(b)?)+`

matching this regex with the string - "aba", you get the following 2 matches:

  • "ab" - Capture Group 1 = "ab" (due to outer parenthesis), Capture Group 2 = "b"(due to inner parenthesis)
  • "a" - Capture Group 1 = "a", Capture Group 2 = None. (This is because second capture group (b)? is optional. So, it successfully matches the last a.

So, finally your Capture group 1 will contain "a",which overrides earlier captured group "ab", and Capture group 2 will contain "b", which is not overridden.

Autres conseils

Named captures or not is irrelevant in this case.

Consider this input text:

foo-bar-baz

and this regex:

[a-z]+(-[a-z]+)*

Now the question is what is captured by group 1?

As the regex progresses through the text, it first matches -bar which is then the contents of group 1; but then it goes on in the text and recognizes -baz which is now the new content of group 1.

Therefore, -bar is "lost": the regex engine has discarded it because further text in the input matched the capturing group. This is what is meant by this:

[t]he captured input associated with a group is always the subsequence that the group most recently matched [emphasis mine]

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top