Is len(text.split()) faster than text.count(“ ”)? And why so? python
-
21-12-2019 - |
Question
Given that I have 1,000,000,000 lines of ~20-100 tokens per line delimited by whitespace, counting the length of each line becomes sort of non-trival.
Assuming that there are never double whitespaces between two tokens,
Is
len(text.split())
faster thantext.count(" ")+1
?And why so?
Solution
Easy to check which is faster:
>python -m timeit -s "s='q w e r t y u i o p a s d f g h j k l'" "s.count(' ')+1"
1000000 loops, best of 3: 0.272 usec per loop
>python -m timeit -s "s='q w e r t y u i o p a s d f g h j k l'" "len(s.split())"
1000000 loops, best of 3: 0.653 usec per loop
split
is slower, probably because it constructs the splitted list.
OTHER TIPS
text.count(" ")
is wrong, see below:
In [706]: t='a b c'
In [707]: t.split()
Out[707]: ['a', 'b', 'c']
In [708]: t.count(' ')
Out[708]: 6
You don't wanna get 6 in this case.
Your premise is incorrect. Both of those operations do not give the same results, lets use your question as an example:
>>> text = "Given that I have 1,000,000,000 lines of ~20-100 tokens per line delimited by whitespace, counting the length of each line becomes sort of non-trival."
>>> len(text.split())
24
>>> text.count(" ")
23
Given your question of "counting the length of each line" neither of those operations even do that.
To count each line you need to do:
line_lengths = [len(line) for line in text.splitlines()]
But it would probably be better to also note the line number:
line_lengths = [(idx, len(line)) for idx, line in enumerate(text.splitlines())]