I don't understand why/how one of these methods is faster than the others

https://stackoverflow.com/questions/23666346

22-07-2023
|

Pergunta

I wanted to test the difference in time between implementations of some simple code. I decided to count how many values out of a random sample of 10,000,000 numbers is greater than 0.5. The random sample is grabbed uniformly from the range [0.0, 1.0).

Here is my code:

from numpy.random import random_sample; import time; 
n = 10000000;

t1 = time.clock();

t = 0;
z = random_sample(n);
for x in z:
    if x > 0.5: t += 1;
print t;

t2 = time.clock();

t = 0;
for _ in xrange(n):
    if random_sample() > 0.5: t += 1;
print t;

t3 = time.clock();

t = (random_sample(n) > 0.5).sum();
print t;

t4 = time.clock();

print t2-t1; print t3-t2; print t4-t3;

This is the output:

4999445
4999511
5001498
7.0348236652
1.75569394301
0.202538106332

I get that the first implementation sucks because creating a massive array and then counting it element-wise is a bad idea, so I thought that the second implementation would be the most efficient.

But how is the third implementation 10 times faster than the second method? Doesn't the third method also create a massive array in the form of random_sample(n) and then go through it checking values against 0.5?

How is this third method different from the first method and why is it ~35 times faster than the first method?

EDIT: @merlin2011 suggested that Method 3 probably doesn't create the full array in memory. So, to test that theory I tried the following:

z = random_sample(n);
t = (z > 0.5).sum();
print t;

which runs in a time of 0.197948451549 which is practically identical to Method 3. So, this is probably not a factor.

Solução

Method 1 generates a full list in memory before using it. This is slow because the memory has to be allocated and then accessed, probably missing the cache multiple times.
Method 2 uses an generator, which never creates the list in memory but instead generates each element on demand.
Method 3 is probably faster because sum() is implemented as a loop in C but I am not 100% sure. My guess is that this is faster for the same reason that Matlab vectorization is faster than for loops in Matlab.

Update: Separating out each of three steps, I observe that method 3 is still equally fast, so I have to agree with utdemir that each individual operator is executing instructions closer to machine code.

z = random_sample(n)
z2 = z > 0.5
t = z2.sum();

In each of the first two methods, you are invoking Python's standard functionality to do a loop, and this is much slower than a C-level loop that is baked into the implementation.

Outras dicas

AFAIK

Function calls are heavy, on method two, you're calling random_sample() 10000000 times, but on third method, you just call it once.
Numpy's > and .sum are optimized to their last bits in C, also most probably using SIMD instructions to avoid loops.

So,

On method 2, you are comparing and looping using Python; but on method 3, you're much closer to the processor and using optimized instructions to compare and sum.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow