Question

I was reading the Groovy tutorial and they talk about how you can create non-matching groups by leading the group off with ?:. This way the group will not come up on the matcher. What I dont understand is why you would want to explicitly say do not match this group. Wouldnt it be simpler to just not put it into a group?

Was it helpful?

Solution

?: is used to group but when you do not want to capture them. This is useful for brevity of code and sometimes out rightly necessary. This helps in not storing something that we don't need subsequently after matching thus saving space.

They are also used mostly in conjunction with | operator.

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping. (http://www.regular-expressions.info/alternation.html).

In this case, you cannot just leave them without putting them in a group. You will need the alternation operator in many usual regexes such as email, url etc. Hope that helps.

/(?:http|ftp):\/\/([^\/\r\n]+)(\/[^\r\n]*)?/g is a sample URL regex in JavaScript which needs the alternation operator and needs grouping. Without grouping the match would be just http for all http urls.

OTHER TIPS

There are at least four reasons for using a non-capturing group:

1) Save Memory: When you match a capturing group, the group's content is stored independently in memory, whether you need it or not. That space in memory can add up quickly when you're using regex and storing the results on a large set of data. For instance, [0-9]+(, [0-9]+)* will match a series of integers separated by commas and spaces like 15, 13, 14. Let's assume you only need whole matching string from the result (group 0). In this case, though, you'll really be storing "15, 13, 14" and ", 14", since the latter is in a captured group. You can save memory and time by using [0-9]+(?:, [0-9]+)* instead. It might not matter for such a simple and short example, but with more complicated regexes, those extra bits of memory usage add up fast. As a bonus, non-capturing groups are also faster to process.

2) Simpler Code: If you've got a regex like ([a-z]+)( \.)* ([a-z]+) ([a-z]+) and want to extract the three words, you'd need to use groups 1, 3, and 4. While that's not terribly difficult, imagine that you need to add another group between the latter two words like ([a-z]+)( \.)* ([a-z]+)( \.)* ([a-z]+). If you use these groups in several places later in your code, it may be hard to track them down. Instead, you can first write ([a-z]+)(?: \.)* ([a-z]+) ([a-z]+) at first, and then change it to ([a-z]+)(?: \.*) ([a-z]+)(?: \.)* ([a-z]+), both of which match the words to groups 1, 2, and 3 respectively.

3) External Dependencies: You might have a function or library which needs to receive a regex match with exactly n groups. This is an unusual instance, but making all the other groups non-capturing will satisfy the requirement.

4) Group Count Limits: Most languages have a limit to the overall number of capturing groups in a regex. It's unusual to need that many groups (100 for python, for instance), it is possible. You can use fewer groups and run up against this limit less frequently by using non-captured groups which are not limited in that way. For instance:

((one|1), )((two|2), )…((nine_hundred_ninety_nine|999), )

where the is all the in-between groups wouldn't match in some languages because it has too many capturing groups. But:

(?:(one|1), )(?:(two|2), )…(?:(nine_hundred_ninety_nine|999), )

would match and still return all the groups like one or 22.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top