Question

string = 'protein219 Info=Acidfast Name="Mycobacterium   smegmatis" pcp=36789'

I would like to split the string ignoring the whitespaces between "" . I am using the below regex to split the line

mystring = [s for s in re.split("( |\\\".*?\\\"|'.*?')", mystring) if s.strip()] 

Which gives me the result as

['protein219', 'Info=Acidfast', 'Name=' , '"Mycobacterium  smegmatis"', 'pcp=','36789']

Expected Output:

['protein219', 'Info=Acidfast', 'Name="Mycobacterium   smegmatis"',' pcp=36789']

please provide your suggestion

No correct solution

OTHER TIPS

Don't use re.split() for this:

>>> re.findall(r'(?:"[^"]*"|[^\s"])+', string)
['protein219', 'Info=Acidfast', 'Name="Mycobacterium   smegmatis"', 'pcp=36789']

Explanation:

(?:       # Start of non-capturing group
 "[^"]*"  # Either match a quoted string
|         # or
 [^\s"]   # anything besides spaces or quotes
)+        # End of group, match at least once

You need every thing that contains either no space, or space between quotes:

re.findall(r'[^\s]*".*"', string)

will match Name="Mycobacterium smegmatis"

re.findall(r'[^\s]+', string)

will match all the others. Combining :

re.findall(r'(?:[^\s]*".*")|(?:[^\s]+)', string)

(?: means non-capturing group, making the result a plain list.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top