Python positive-lookbehind split variable-width

https://stackoverflow.com/questions/22748123

24-06-2023
|

Question

I though that I have set up the expression appropriately, but the split is not working as intended.

c = re.compile(r'(?<=^\d\.\d{1,2})\s+');
for header in ['1.1 Introduction', '1.42 Appendix']:
    print re.split(c, header)

Expected result:

['1.1', 'Introduction']
['1.42',  'Appendix']

I am getting the following stacktrace:

Traceback (most recent call last):
     File "foo.py", line 1, in
          c = re.compile(r'(?<=^\d.\d{1,2})\s+');
     File "C:\Python27\lib\re.py", line 190, in compile
          return _compile(pattern, flags)
     File "C:\Python27\lib\re.py", line 242, in _compile
          raise error, v # invalid expression
sre_constants.error: look-behind requires fixed-width pattern
<<< Process finished. (Exit code 1)

Solution

Lookbehinds in python cannot be of variable width, so your lookbehind is not valid.

You can use a capture group as a workaround:

c = re.compile(r'(^\d\.\d{1,2})\s+');
for header in ['1.1 Introduction', '1.42 Appendix']:
    print re.split(c, header)[1:] # Remove the first element because it's empty

Output:

['1.1', 'Introduction']
['1.42', 'Appendix']

OTHER TIPS

your error in the regex is in the part {1,2} because Lookbehinds need to be fixed-width, thus quantifiers are not allowed.

try this website to test your regex before you put it in code.

BUT in your case you don't need to use regex at all:

simply try this:

for header in ['1.1 Introduction', '1.42 Appendix']:
    print header.split(' ')

result:

['1.1', 'Introduction']
['1.42', 'Appendix']

hope this helps.

My solution may look lame. But you are checking only two digits after dot. So, you can use two lookbehind.

c = re.compile(r'(?:(?<=^\d\.\d\d)|(?<=^\d\.\d))\s+');

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow