Question

On my OS X 10.5.8 machine, using the regcomp and regexec C functions to match the extended regex "(()|abc)xyz", I find a match for the string "abcxyz" but only from offset 3 to offset 6. My expectation was that the entire string would be matched and that I would see a submatch for the initial "abc" part of the string.

When I try the same pattern and text with awk on the same machine, it shows a match for the entire string as I would expect.

I expect that my limited experience with regular expressions may be the problem. Can somebody explain what is going on? Is my regular expression valid? If so, why doesn't it match the entire string?

I understand that "((abc){0,1})xyz" could be used as an alternative, but the pattern of interest is being automatically generated from another pattern format and eliminating instances of "()" is extra work I'd like to avoid if possible.

For reference, the flags I'm passing to regcomp consist only of REG_EXTENDED. I pass an empty set of flags (0) to regexec.

Was it helpful?

Solution

The POSIX standard says:

9.4.3 ERE Special Characters

An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:

.[\(

The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.

What you are seeing is the result of invoking undefined behaviour - anything goes.

If you want reliable, portable results, you will have to eliminate the empty '()' notations.

OTHER TIPS

If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.

Try (abc|())xyz - I bet it'll produce the same result in both places. I can only presume that the C version is trying to match xyz wherever it can, and if that fails, it tries to match abcxyz wherever it can (but, as you see, it doesn't fail, so we never bother with the "abc" part) whereas awk must be using it's own regex engine that performs the way you expect.

Your regex is valid. I think the problem is either a) POSIX isn't very clear about how the regex should work, or b) awk isn't using 100% POSIX-compliant regexes (probably because it appears OS X ships with a more original version of awk). Whichever problem it is, it's probably caused because this is somewhat of an edge case and most people wouldn't write the regex that way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top