Efficiently partition a string at arbitrary index

https://stackoverflow.com/questions/20696084

20-09-2022
|

Question

Given an arbitrary string (i.e., not based on a pattern), say:

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

I am trying to partition a string based a list of indexes.

Here is what I tried, which does work:

import string

def split_at_idx(txt, idx):
    new_li=[None]*2*len(idx)
    new_li[0::2]=idx
    new_li[1::2]=[e for e in idx]
    new_li=[0]+new_li+[len(txt)]
    new_li=[new_li[i:i+2] for i in range(0,len(new_li),2)]  
    print(new_li)
    return [txt[st:end] for st, end in new_li]

print(split_at_idx(string.ascii_letters, [3,10,12,40]))  
# ['abc', 'defghij', 'kl', 'mnopqrstuvwxyzABCDEFGHIJKLMN', 'OPQRSTUVWXYZ']

The split is based on a list of indexes [3,10,12,40]. This list then needs to be transformed into a list of start, end pairs like [[0, 3], [3, 10], [10, 12], [12, 40], [40, 52]]. I used a slice assignment to set the evens and odds, then a list comprehension to group into pairs and a second LC to return the partitions.

This seems a little complex for such a simple function. Is there a better / more efficient / more idiomatic way to do this?

Solution

I have a feeling someone asked this question very recently, but I can't find it now. Assuming that the dropped letters were an accident, couldn't you just do:

def split_at_idx(s, idx):
    return [s[i:j] for i,j in zip([0]+idx, idx+[None])]

after which we have

>>> split_at_idx(string.ascii_letters, [3, 10, 12, 40])
['abc', 'defghij', 'kl', 'mnopqrstuvwxyzABCDEFGHIJKLMN', 'OPQRSTUVWXYZ']
>>> split_at_idx(string.ascii_letters, [])
['abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ']
>>> split_at_idx(string.ascii_letters, [13, 26, 39])
['abcdefghijklm', 'nopqrstuvwxyz', 'ABCDEFGHIJKLM', 'NOPQRSTUVWXYZ']

OTHER TIPS

This seems like a job for itertools.groupby.

def split_at_indices(text, indices):
    [''.join(e[1] for e in g) for k,g in groupby(
      enumerate(text), key=lambda x: bisect_right(indices, x[0])
     )]

You will need to import bisect_right from the bisect module.

This works the way you'd think an efficient implementation should: for each character in the string, it uses binary search in indices to compute a number representing which string in the final list that character should go in, and then groupby separates the characters by those numbers. Though it turns out to be less efficient in most cases, because array access is so quick.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow