Question

Im having an issue where my regex is matching too much. I've tried making it as non-greedy as possible. My RE is:

 define host( |\t)*{(.*\n)*?( |\t)*host_name( |\t)*HOST_B(.*\n)*?( |\t)*}

meaning

"define host" followed by any spaces or tabs followed by "{". Any text and newlines until any number of spaces or tabs followed by "host_name" followed by any number of spaces or tabs followed by "HOST_B". Any text plus newlines until any spaces or tabs followed by "}"

My text is

define host{
    field stuff
        }

define timeperiod{
        sunday          00:00-03:00,07:00-24:00
        }

define stuff{
        hostgroup_name                  things
        service_description             load
        dependent_service_description   cpu_util
        execution_failure_criteria      n
        notification_failure_criteria   w,u,c
        }

define host{
        use                     things
        host_name               HOST_A
        0alias                  stuff 
       }

define host{
        use                     things
        host_name               HOST_B
        alias                   ughj
        address                 1.6.7.6
       }

define host{
        use                     things
        host_name               HOST_C
       }

The match is going from the first define to host_b's end bracket. It is not getting host_c's group (it should not get host_c), however I would like only host b's group and not the whole thing.

Any help? My regex is rusty. You can test on http://regexpal.com/

Was it helpful?

Solution

I have not tested it, but I guess you need to remove .* with [^{]*. This way your regex does not eat the next "{".

This looks strange to me: (.*\n)*? Have a look at DOTALL: If you set this flag the dot eats newlines.

OTHER TIPS

It's a bit different than what you asked for, but I think you may like the results. This will parse all your structures and load them into python dictionaries. From there, manipulation should be really nice and easy for you.

mDefHost = re.findall(r"\define host{(.*?)\}",a,re.S)
mInHost  = re.compile("(\S+)\s+(\S+)")
hostDefs = []

for item in mDefHost:
    hostDefs.append( dict(mInHost.findall(item)) )

ex output

>>> m = re.findall(r"define host\{(.*?)\}",a,re.S)
>>> m
['\n        use                     things\n        host_name               HOST_B\n            alias                   ughj\n        address                 1.6.7.6\n       ']
>>> item = m[0]
>>> item
'\n        use                     things\n        host_name               HOST_B\n            alias                   ughj\n        address                 1.6.7.6\n       '
>>> results = re.findall("(\S+)\s+(\S+)",item)
>>> results
[('use', 'things'), ('host_name', 'HOST_B'), ('alias', 'ughj'), ('address', '1.6.7.6')]
>>> dict(results)
{'alias': 'ughj', 'use': 'things', 'host_name': 'HOST_B', 'address': '1.6.7.6'}

The problem is that you're using regex to search the entire string, but you're trying to find a substring that starts in a way indistinguishable from the start of the entire string. You can't use non-greedy matching to ensure that your starting point is as late as possible; the non-greedy modifier only affects how far ahead the Regex engine will look to find a match.

What you need is to make sure that you have no closing brackets between your define host and your HOST_B. Try this (untested):

define host\s*{[^}]HOST_B.*?}

(Make sure you use a flag to allow . to match newlines.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top