Python tabstop-aware len() and padding functions
Question
Python's len() and padding functions like string.ljust() are not tabstop-aware, i.e. they treat '\t' like any other single-width character, and don't round len up to the nearest multiple of tabstop. Example:
len('Bear\tnecessities\t')
is 17 instead of 24 ( i.e. 4+(8-4)+11+(8-3) )
and say I also want a function pad_with_tabs(s)
such that
pad_with_tabs('Bear', 15) = 'Bear\t\t'
Looking for simple implementations of these - compactness and readability first, efficiency second. This is a basic but irritating question. @gnibbler - can you show a purely Pythonic solution, even if it's say 20x less efficient?
Sure you could convert back and forth using str.expandtabs(TABWIDTH), but that's clunky.
Importing math to get TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) )
also seems like massive overkill.
I couldn't manage anything more elegant than the following:
TABWIDTH = 8
def pad_with_tabs(s,maxlen):
s_len = len(s)
while s_len < maxlen:
s += '\t'
s_len += TABWIDTH - (s_len % TABWIDTH)
return s
and since Python strings are immutable and unless we want to monkey-patch our function into string module to add it as a method, we must also assign to the result of the function:
s = pad_with_tabs(s, ...)
In particular I couldn't get clean approaches using list-comprehension or string.join(...)
''.join([s, '\t' * ntabs])
without special-casing the cases where len(s) is < an integer multiple of TABWIDTH, or len(s)>=maxlen already.
Can anyone show better len() and pad_with_tabs() functions?
Solution
TABWIDTH=8
def my_len(s):
return len(s.expandtabs(TABWIDTH))
def pad_with_tabs(s,maxlen):
return s+"\t"*((maxlen-len(s)-1)/TABWIDTH+1)
Why did I use expandtabs()
?
Well it's fast
$ python -m timeit '"Bear\tnecessities\t".expandtabs()'
1000000 loops, best of 3: 0.602 usec per loop
$ python -m timeit 'for c in "Bear\tnecessities\t":pass'
100000 loops, best of 3: 2.32 usec per loop
$ python -m timeit '[c for c in "Bear\tnecessities\t"]'
100000 loops, best of 3: 4.17 usec per loop
$ python -m timeit 'map(None,"Bear\tnecessities\t")'
100000 loops, best of 3: 2.25 usec per loop
Anything that iterates over your string is going to be slower, because just the iteration is ~4 times slower than expandtabs
even when you do nothing in the loop.
$ python -m timeit '"Bear\tnecessities\t".split("\t")'
1000000 loops, best of 3: 0.868 usec per loop
Even just splitting on tabs takes longer. You'd still need to iterate over the split and pad each item to the tabstop
OTHER TIPS
I believe gnibbler's is the best for most prectical cases. But anyway, here is a naive (without accounting CR, LF etc) solution to compute the length of string without creating expanded copy:
def tab_aware_len(s, tabstop=8):
pos = -1
extra_length = 0
while True:
pos = s.find('\t', pos+1)
if pos<0:
return len(s) + extra_length
extra_length += tabstop - (pos+extra_length) % tabstop - 1
Probably it could be useful for some huge strings or even memory mapped files. And here is padding function a bit optimized:
def pad_with_tabs(s, max_len, tabstop=8):
length = tab_aware_len(s, tabstop)
if length<max_len:
s += '\t' * ((max_len-1)//tabstop + 1 - length//tabstop)
return s
TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) )
is indeed a massive over-kill; you can get the same result much more simply. For positive i
and n
, use:
def round_up_positive_int(i, n):
return ((i + n - 1) // n) * n
This procedure works in just about any language I've ever used, after appropriate translation.
Then you can do next_pos = round_up_positive_int(len(s), TABWIDTH)
For a slight increase in the elegance of your code, instead of
while(s_len < maxlen):
use this:
while s_len < maxlen: