using regular expression to split string with multiple spaces.

https://stackoverflow.com/questions/18681828

28-06-2022
|

Question

I'm trying to split a string that is delimited by multiple spaces i.e:

    string1 = "abcd    efgh   a. abcd   b efgh"
    print re.findall(r"[\w.]+")

as expected, the results are:

    ['abcd', 'efgh', 'a.', 'abcd', 'b', 'efgh']

However, I would like to group 'a.' and 'abcd' into the same group, and 'b' and 'efgh' into the same group. So the result I want would look something like:

    ['abcd', 'efgh', 'a. abcd', 'b efgh']

My approach at the moment is to create two types of expression. The first to deal with the regular expression without the space i.e. 'abcd' and 'efgh'. The second to deal with the ones with a single space. i.e. 'a.' + 'abcd'.

So if r'[\w]+ can deal with the first type, and r'[\w]+ [\w]+ can deal with the second type. But I don't know how to combine them into the same expression using '|'.

As always, any other approaches are welcome. And thanks for your time!

Solution

result = [s.strip() for s in string1.split('  ') if s.strip()]

i.e. splitting on two spaces and removing extraneous spaces from the result (using strip).

OTHER TIPS

If you want to use re.findall, you can use this expression:

>>> string1 = "abcd    efgh   a. abcd   b efgh"
>>> print re.findall(r"\S+(?:\s\S+)*", string1)
['abcd', 'efgh', 'a. abcd', 'b efgh']

(?:\S+(?:\s\S+)*) finds a non space character followed by a single space and more non-space characters multiple times if they exist, so that this works too:

>>> string1 = "abcd    efgh   a. abcd   b efgh ijkl"
>>> print re.findall(r"\S+(?:\s\S+)*", string1)
['abcd', 'efgh', 'a. abcd', 'b efgh ijkl']

Otherwise, it's much simpler to use split by more than 2 spaces:

>>> string1 = "abcd    efgh   a. abcd   b efgh ijkl"
>>> print re.split(r"\s{2,}", string1)
['abcd', 'efgh', 'a. abcd', 'b efgh ijkl']

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow