You can simply split by index. You could either hardcode the indexes, or detect them:
l=["Blah blah, blah bao 123456 ",
"hello, hello, hello miao 299292929 "]
def detect_column_indexes( list_of_lines ):
indexes=[0]
transitions= [col.count(' ')==len(list_of_lines) for col in zip(*list_of_lines)]
last=False
for i, x in enumerate(transitions):
if not x and last:
indexes.append(i)
last=x
indexes.append( len(list_of_lines[0])+1 )
return indexes
def split_line_by_indexes( indexes, line ):
tokens=[]
for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
tokens.append( line[i1:i2].rstrip() )
return tokens
indexes= detect_column_indexes( l )
parsed= [split_line_by_indexes(indexes, line) for line in l]
print indexes
print parsed
output:
[0, 30, 38, 50]
[['Blah blah, blah', 'bao', '123456'], ['hello, hello, hello', 'miao', '299292929']]
Obviously, it's not possible to tell apart trailing whitespace on each collumn - but you can detect leading whitespace by using rstrip
instead of strip
.
This method is not foolproof, but is more robust than detecting two consecutive whitespaces.