Question

Is there a way to use a non-greedy regular expression in C like one can use in Perl? I tried several things, but it's actually not working.

I'm currently using this regex that matches an IP address and the corresponding HTTP request, but it's greedy although I'm using the *?:

([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1

In this example, it always matches the whole string:

#include <regex.h>
#include <stdio.h>

int main() {

    int a, i;
    regex_t re;
    regmatch_t pm;
    char *mpages = "TEST 127.0.0.1 GET /test.php HTTP/1.1\" 404 525 \"-\" \"Mozilla/5.0 (Windows NT  HTTP/1.1 TEST";

    a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED);

    if(a!=0)
        printf(" -> Error: Invalid Regex");

    a = regexec(&re, &mpages[0], 1, &pm, REG_EXTENDED);

    if(a==0) {

        for(i = pm.rm_so; i < pm.rm_eo; i++)
            printf("%c", mpages[i]);
        printf("\n");
    }
    return 0;
}

$ ./regtest

127.0.0.1 GET /test.php HTTP/1.1" 404 525 "-" "Mozilla/5.0 (Windows NT HTTP/1.1

Was it helpful?

Solution

No, there are no non-greedy quantifiers in POSIX regular expressions. But there is a library that provides perl-like regular expressions for C: http://www.pcre.org/

OTHER TIPS

As I said earlier in a comment, use grep -E to run tests with POSIX regexes, in that way development time will be improved. Either way, It seems your problem it's with the regular expression rather than with the missing feature.

I'm not quite clear of what you want to grab from the request... supposing you just want the IP address, the HTTP verb and the resource, one could end up with the following regex.

regcomp(&re, "\\b(.?[0-9])+\\s+(GET|POST|PUT)\\s+([^ ]+)", REG_EXTENDED);

Be aware that several assumptions have been made. For example, this regex assumes the IP address will be well formed, it also assumes a request with a HTTP verb either GET, POST, PUT. Edit accordantly to your needs.

The brute-force method of getting a regex to match up to the next occurrence of a word is:

"([^H]|H[^T]|HT[^T]|HTT[^P]|HTTP{^/]|HTTP/[^1]|HTTP/1[^.]|HTTP/1\\.[^1])*HTTP/1\\.1"

unless you can get smarter about your match -- which you can: HTTP requests are

Request-Line   = Method SP Request-URI SP HTTP-Version CRLF

and none of the nonterminals on the right match embedded spaces. So:

"[0-9]{1,3}(\\.[0-9]{1,3}){3} [^ ]* [^ ]* HTTP/1\\.1"

since you're only allocating space for the whole-expression match, or put the parens back in to get pieces.

a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1",  REG_EXTENDED|REG_ENHANCED);  

Doesn't have this macro in the old time

#if __MAC_OS_X_VERSION_MIN_REQUIRED  >= __MAC_10_8 \
 || __IPHONE_OS_VERSION_MIN_REQUIRED >= __IPHONE_6_0
#define REG_ENHANCED    0400    /* Additional (non-POSIX) features */
#endif

In your code, pm should be an array of regmatch_t, and in your case, should have at least 2 to 4 elements, depending upon which () sub-expressions you want to capture.

You have only one element. The first element, pm[0], always gets whatever text matches your entire RE. That's the one you'll be getting. It is pm[1] that will get the text of the first () sub-expression (the IP address), and pm[3] that will get the text matching your (.*?) term.

But even so, as stated above (by Wumbley, W. Q.) the POSIX regex library may not support non-greedy quantifiers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top