Question

Given that I have 1,000,000,000 lines of ~20-100 tokens per line delimited by whitespace, counting the length of each line becomes sort of non-trival.

Assuming that there are never double whitespaces between two tokens,

  1. Is len(text.split()) faster than text.count(" ")+1?

  2. And why so?

Was it helpful?

Solution

Easy to check which is faster:

>python -m timeit -s "s='q w e r t y u i o p a s d f g h j k l'" "s.count(' ')+1"
1000000 loops, best of 3: 0.272 usec per loop

>python -m timeit -s "s='q w e r t y u i o p a s d f g h j k l'" "len(s.split())"
1000000 loops, best of 3: 0.653 usec per loop

split is slower, probably because it constructs the splitted list.

OTHER TIPS

text.count(" ") is wrong, see below:

In [706]: t='a  b    c'

In [707]: t.split()
Out[707]: ['a', 'b', 'c']

In [708]: t.count(' ')
Out[708]: 6

You don't wanna get 6 in this case.

Your premise is incorrect. Both of those operations do not give the same results, lets use your question as an example:

>>> text = "Given that I have 1,000,000,000 lines of ~20-100 tokens per line delimited by whitespace, counting the length of each line becomes sort of non-trival."
>>> len(text.split())
24
>>> text.count(" ")
23

Given your question of "counting the length of each line" neither of those operations even do that.

To count each line you need to do:

line_lengths = [len(line) for line in text.splitlines()]

But it would probably be better to also note the line number:

line_lengths = [(idx, len(line)) for idx, line in enumerate(text.splitlines())]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top