Regex Python / group quantifiers

https://stackoverflow.com/questions/6806811

25-10-2019
|

문제

I want to match a list of variables which look like directories, e.g.:

Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123
Same/Same2/Battery/Name=SomeString
Same/Same2/Home/Land/Some/More/Stuff=0.34

The length of the "subdirectories" is variable having an upper bound (above it's 9). I want to group every subdirectory except the 1st one which I named "Same" above.

The best I could come up with is:

^(?:([^/]+)/){4,8}([^/]+)=(.*)

It already looks for 4-8 subdirectories but only groups the last one. Why's that? Is there a better solution using group quantifiers?

Edit: Solved. Will use split() instead.

해결책

import re

regx = re.compile('(?:(?<=\A)|(?<=/)).+?(?=/|\Z)')


for ss in ('Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123',
           'Same/Same2/Battery/Name=SomeString',
           'Same/Same2/Home/Land/Some/More/Stuff=0.34'):

    print ss
    print regx.findall(ss)
    print

Edit 1

Now you have given more info on what you want to obtain ( _"Same/Same2/Battery/Name=SomeString becoming SAME2_BATTERY_NAME=SomeString"_ ) better solutions can be proposed: either with a regex or with split() , + replace()

import re
from os import sep

sep2 = r'\\' if sep=='\\' else '/'

pat = '^(?:.+?%s)(.+$)' % sep2
print 'pat==%s\n' % pat

ragx = re.compile(pat)

for ss in ('Same\Same2\Foot\Ankle\Joint\Actuator\Sensor\Temperature\Value=4.123',
           'Same\Same2\Battery\Name=SomeString',
           'Same\Same2\Home\Land\Some\More\Stuff=0.34'):

    print ss
    print ragx.match(ss).group(1).replace(sep,'_')
    print ss.split(sep,1)[1].replace(sep,'_')
    print

result

pat==^(?:.+?\\)(.+$)

Same\Same2\Foot\Ankle\Joint\Actuator\Sensor\Temperature\Value=4.123
Same2_Foot_Ankle_Joint_Actuator_Sensor_Temperature_Value=4.123
Same2_Foot_Ankle_Joint_Actuator_Sensor_Temperature_Value=4.123

Same\Same2\Battery\Name=SomeString
Same2_Battery_Name=SomeString
Same2_Battery_Name=SomeString

Same\Same2\Home\Land\Some\More\Stuff=0.34
Same2_Home_Land_Some_More_Stuff=0.34
Same2_Home_Land_Some_More_Stuff=0.34

Edit 2

Re-reading your comment, I realized that I didn't take in account that you want to upper the part of the strings that lies before the '=' sign but not after it.

Hence, this new code that exposes 3 methods that answer this requirement. You will choose which one you prefer:

import re

from os import sep
sep2 = r'\\' if sep=='\\' else '/'



pot = '^(?:.+?%s)(.+?)=([^=]*$)' % sep2
print 'pot==%s\n' % pot
rogx = re.compile(pot)

pet = '^(?:.+?%s)(.+?(?==[^=]*$))' % sep2
print 'pet==%s\n' % pet
regx = re.compile(pet)


for ss in ('Same\Same2\Foot\Ankle\Joint\Sensor\Value=4.123',
           'Same\Same2\Battery\Name=SomeString',
           'Same\Same2\Ocean\Atlantic\North=',
           'Same\Same2\Maths\Addition\\2+2=4\Simple=ohoh'):
    print ss + '\n' + len(ss)*'-'

    print 'rogx groups  '.rjust(32),rogx.match(ss).groups()

    a,b = ss.split(sep,1)[1].rsplit('=',1)
    print 'split split  '.rjust(32),(a,b)
    print 'split split join upper replace   %s=%s' % (a.replace(sep,'_').upper(),b)

    print 'regx split group  '.rjust(32),regx.match(ss.split(sep,1)[1]).group()
    print 'regx split sub  '.rjust(32),\
          regx.sub(lambda x: x.group(1).replace(sep,'_').upper(), ss)
    print

result, on a Windows platform

pot==^(?:.+?\\)(.+?)=([^=]*$)

pet==^(?:.+?\\)(.+?(?==[^=]*$))

Same\Same2\Foot\Ankle\Joint\Sensor\Value=4.123
----------------------------------------------
                   rogx groups   ('Same2\\Foot\\Ankle\\Joint\\Sensor\\Value', '4.123')
                   split split   ('Same2\\Foot\\Ankle\\Joint\\Sensor\\Value', '4.123')
split split join upper replace   SAME2_FOOT_ANKLE_JOINT_SENSOR_VALUE=4.123
              regx split group   Same2\Foot\Ankle\Joint\Sensor\Value
                regx split sub   SAME2_FOOT_ANKLE_JOINT_SENSOR_VALUE=4.123

Same\Same2\Battery\Name=SomeString
----------------------------------
                   rogx groups   ('Same2\\Battery\\Name', 'SomeString')
                   split split   ('Same2\\Battery\\Name', 'SomeString')
split split join upper replace   SAME2_BATTERY_NAME=SomeString
              regx split group   Same2\Battery\Name
                regx split sub   SAME2_BATTERY_NAME=SomeString

Same\Same2\Ocean\Atlantic\North=
--------------------------------
                   rogx groups   ('Same2\\Ocean\\Atlantic\\North', '')
                   split split   ('Same2\\Ocean\\Atlantic\\North', '')
split split join upper replace   SAME2_OCEAN_ATLANTIC_NORTH=
              regx split group   Same2\Ocean\Atlantic\North
                regx split sub   SAME2_OCEAN_ATLANTIC_NORTH=

Same\Same2\Maths\Addition\2+2=4\Simple=ohoh
-------------------------------------------
                   rogx groups   ('Same2\\Maths\\Addition\\2+2=4\\Simple', 'ohoh')
                   split split   ('Same2\\Maths\\Addition\\2+2=4\\Simple', 'ohoh')
split split join upper replace   SAME2_MATHS_ADDITION_2+2=4_SIMPLE=ohoh
              regx split group   Same2\Maths\Addition\2+2=4\Simple
                regx split sub   SAME2_MATHS_ADDITION_2+2=4_SIMPLE=ohoh

다른 팁

I probably misunderstood what exactly you want to do, but here is how you would do it without regex:

for entry in list_of_vars:
    key, value = entry.split('=')
    key_components = key.split('/')
    if 4 <= len(key_components) <= 8:
        # here the actual work is done
        print "%s=%s" % ('_'.join(key_components[1:]).upper(), value)

Just use split?

>>> p='Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123'
>>> p.split('/')
['Same', 'Same2', 'Foot', 'Ankle', 'Joint', 'Actuator', 'Sensor', 'Temperature', 'Value=4.123']

Also, if you want that key/val pair you can do something like this...

>>> s = p.split('/')
>>> s[-1].split('=')
['Value', '4.123']

A couple of variations on your theme. For one, I've always found regexen to be cryptic to the point of unmaintainable, so I wrote the pyparsing module. In my mind, I look at your code and think, "oh, it's a list of '/'-delimited strings, an '=' sign, and then some kind of rvalue." And that translates pretty directly into the pyparsing parser definition code. By adding a name here and there in the parser ("key" and "value", similar to named groups in regex), the output is pretty easily processed.

data="""\
Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123
Same/Same2/Battery/Name=SomeString
Same/Same2/Home/Land/Some/More/Stuff=0.34""".splitlines()

from pyparsing import Word, alphas, alphanums, Word, nums, QuotedString, delimitedList

wd = Word(alphas, alphanums)
number = Word(nums+'+-', nums+'.').setParseAction(lambda t:float(t[0]))
rvalue = wd | number | QuotedString('"')

defn = delimitedList(wd, '/')('key') + '=' + rvalue('value')

for d in data:
    result = defn.parseString(d)

Second, I question your approach at defining all of those variable names - creating variable names on the fly based on your data is a pretty well-recognized Code Smell (not necessarily bad, but you might really want to rethink this approach). I used a recursive defaultdict to create a navigable structure so that you can easily do operations like "find all the entries that are sub-elements of "Same2" (in this case, "Foot", "Battery", and "Home") - this kind of work is more difficult when trying to sift through some collection of variable names as found in locals(), it seems to me you will end up re-parsing these names to reconstruct the key hierarchy.

from collections import defaultdict

class recursivedefaultdict(defaultdict):
    def __init__(self, attrFactory=int):
        self.default_factory = lambda : type(self)(attrFactory)
        self._attrFactory = attrFactory
    def __getattr__(self, attr):
        newval = self._attrFactory()
        setattr(self, attr, newval)
        return newval

table = recursivedefaultdict()

# parse each entry, and accumulate into hierarchical dict
for d in data:
    # use pyparsing parser, gives us key (list of names) and value
    result = defn.parseString(d)
    t = table
    for k in result.key[:-1]:
        t = t[k]
    t[result.key[-1]] = result.value


# recursive method to iterate over hierarchical dict    
def showTable(t, indent=''):
    for k,v in t.items():
        print indent+k,
        if isinstance(v,dict):
            print
            showTable(v, indent+'  ')
        else:
            print v

showTable(table)

Prints:

Same
  Same2
    Foot
      Ankle
        Joint
          Actuator
            Sensor
              Temperature
                Value 4.123
    Battery
      Name SomeString
    Home
      Land
        Some
          More
            Stuff 0.34

If you are really set on defining those variable names, then adding some helpful parse actions to pyparsing will reformat the parsed data at parse time, so that it's directly processable afterwards:

wd = Word(alphas, alphanums)
number = Word(nums+'+-', nums+'.').setParseAction(lambda t:float(t[0]))
rvaluewd = wd.copy().setParseAction(lambda t: '"%s"' % t[0])
rvalue = rvaluewd | number | QuotedString('"')

defn = delimitedList(wd, '/')('key') + '=' + rvalue('value')

def joinNamesWithAllCaps(tokens):
    tokens["key"] = '_'.join(map(str.upper, tokens.key))
defn.setParseAction(joinNamesWithAllCaps)

for d in data:
    result = defn.parseString(d)
    print result.key,'=', result.value

Prints:

SAME_SAME2_FOOT_ANKLE_JOINT_ACTUATOR_SENSOR_TEMPERATURE_VALUE = 4.123
SAME_SAME2_BATTERY_NAME = "SomeString"
SAME_SAME2_HOME_LAND_SOME_MORE_STUFF = 0.34

(Note that this also encloses your SomeString value in quotes, so that the resulting assignment statement is valid Python.)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow