For loop item unpacking

https://stackoverflow.com//questions/23039485

21-12-2019
|

Question

Once, after watching Mike Muller's performance optimization tutorial (I think this one), one thought started to live in my head: if performance matters, minimize accessing items in the loop by index, e. g. if you need to access x[1] multiple times in a loop for x in l - assign a variable to x[1] and reuse it in the loop.

Now I have this synthetic example:

import timeit

SEQUENCE = zip(range(1000), range(1, 1001))

def no_unpacking():
    return [item[0] + item[1] for item in SEQUENCE]


def unpacking():    
    return [a + b for a, b in SEQUENCE]


print timeit.Timer('no_unpacking()', 'from __main__ import no_unpacking').timeit(10000)
print timeit.Timer('unpacking()', 'from __main__ import unpacking').timeit(10000)

unpacking() and no_unpacking() functions return the same result. The implementation is different: unpacking() unpacks the items into a and b in the loop; no_unpacking() gets values by indexes.

For python27, it shows:

1.25280499458
0.946601867676

in other words, unpacking() outperforms no_unpacking() by ~25%.

The question is:

why does accessing by index slow things down significantly (even in this simple case)?

Bonus question:

I've also tried this on pypy - there is almost no difference between these 2 functions, performance-wise. Why is that?

Thanks for any help.

Solution

To answer your questions we can check the bytecode generated by the two functions using the dis module:

In [5]: def no_unpacking():
   ...:     s = []
   ...:     for item in SEQUENCE:
   ...:         s.append(item[0] + item[1])
   ...:     return s
   ...: 
   ...: 
   ...: def unpacking():  
   ...:     s = []
   ...:     for a,b in SEQUENCE:
   ...:         s.append(a+b)  
   ...:     return s

^{I have expanded the list-comprehension because in python3 it would be more cumbersome to check the interesting bytecodes. The code is equivalent so it doesn't really matter for our purpose.}

The bytecode for the first function is:

In [6]: dis.dis(no_unpacking)
  2           0 BUILD_LIST               0 
              3 STORE_FAST               0 (s) 

  3           6 SETUP_LOOP              39 (to 48) 
              9 LOAD_GLOBAL              0 (SEQUENCE) 
             12 GET_ITER             
        >>   13 FOR_ITER                31 (to 47) 
             16 STORE_FAST               1 (item) 

  4          19 LOAD_FAST                0 (s) 
             22 LOAD_ATTR                1 (append) 
             25 LOAD_FAST                1 (item) 
             28 LOAD_CONST               1 (0) 
             31 BINARY_SUBSCR        
             32 LOAD_FAST                1 (item) 
             35 LOAD_CONST               2 (1) 
             38 BINARY_SUBSCR        
             39 BINARY_ADD           
             40 CALL_FUNCTION            1 (1 positional, 0 keyword pair) 
             43 POP_TOP              
             44 JUMP_ABSOLUTE           13 
        >>   47 POP_BLOCK            

  5     >>   48 LOAD_FAST                0 (s) 
             51 RETURN_VALUE

Note that the loop has to call BINARY_SUBSCR twice to access the two elements of the tuple.

The bytecode for the second function is:

In [7]: dis.dis(unpacking)
  9           0 BUILD_LIST               0 
              3 STORE_FAST               0 (s) 

 10           6 SETUP_LOOP              37 (to 46) 
              9 LOAD_GLOBAL              0 (SEQUENCE) 
             12 GET_ITER             
        >>   13 FOR_ITER                29 (to 45) 
             16 UNPACK_SEQUENCE          2 
             19 STORE_FAST               1 (a) 
             22 STORE_FAST               2 (b) 

 11          25 LOAD_FAST                0 (s) 
             28 LOAD_ATTR                1 (append) 
             31 LOAD_FAST                1 (a) 
             34 LOAD_FAST                2 (b) 
             37 BINARY_ADD           
             38 CALL_FUNCTION            1 (1 positional, 0 keyword pair) 
             41 POP_TOP              
             42 JUMP_ABSOLUTE           13 
        >>   45 POP_BLOCK            

 12     >>   46 LOAD_FAST                0 (s) 
             49 RETURN_VALUE

Note how there is no BINARY_SUBSCR to perform.

So, it seems like UNPACK_SEQUENCE plus one STORE_FAST (which is the extra operations added by the unpacking) are faster then doing two BINARY_SUBSCR. This is reasonable since BINARY_SUBSCR is a full-blown method call, while UNPACK_SEQUENCE and STORE_FAST are simpler operations.

You can see the difference even in simpler cases:

In [1]: def iter_with_index(s):
   ...:     for i in range(len(s)):
   ...:         s[i]
   ...:         

In [2]: def iter_without_index(s):
   ...:     for el in s:el
   ...:     

In [3]: %%timeit s = 'a' * 10000
   ...: iter_with_index(s)
   ...: 
1000 loops, best of 3: 583 us per loop

In [4]: %%timeit s = 'a' * 10000
   ...: iter_without_index(s)
   ...: 
1000 loops, best of 3: 206 us per loop

As you can see iterating over a string is about 3 times slower using explicit indexing. That's all overhead due to calls to BINARY_SUBSCR.

Regarding your second question: pypy has JIT that is able to analyze the code and produces an optimized version that avoids the overhead of the indexing operations. When it realizes that the subscription is done on tuples it's probably able to produce code that doesn't call the tuple method but accesses the elements directly, thus completely removing the BINARY_SUBSCR operations.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow