Question

The context: my Python code pass arrays of 2D vertices to OpenGL.

I tested 2 approaches, one with ctypes, the other with struct, the latter being more than twice faster.

from random import random
points = [(random(), random()) for _ in xrange(1000)]

from ctypes import c_float
def array_ctypes(points):
    n = len(points)
    return n, (c_float*(2*n))(*[u for point in points for u in point])

from struct import pack
def array_struct(points):
    n = len(points)
    return n, pack("f"*2*n, *[u for point in points for u in point])

Any other alternative? Any hint on how to accelerate such code (and yes, this is one bottleneck of my code)?

Was it helpful?

Solution

You could try Cython. For me, this gives:

function       usec per loop:
               Python  Cython
array_ctypes   1370    1220
array_struct    384     249
array_numpy     336     339

So Numpy only gives 15% benefit on my hardware (old laptop running WindowsXP), whereas Cython gives about 35% (without any extra dependency in your distributed code).

If you can loosen your requirement that each point is a tuple of floats, and simply make 'points' a flattened list of floats:

def array_struct_flat(points):
    n = len(points)
    return pack(
        "f"*n,
        *[
            coord
            for coord in points
        ]
    )

points = [random() for _ in xrange(1000 * 2)]

then the resulting output is the same, but the timing goes down further:

function            usec per loop:
                    Python  Cython
array_struct_flat           157

Cython might be capable of substantially better than this too, if someone smarter than me wanted to add static type declarations to the code. (Running 'cython -a test.pyx' is invaluable for this, it produces an html file showing you where the slowest (yellow) plain Python is in your code, versus python that has been converted to pure C (white). That's why I spread the code above out onto so many lines, because the coloring is done per-line, so it helps to spread it out like that.)

Full Cython instructions are here: http://docs.cython.org/src/quickstart/build.html

Cython might produce similar performance benefits across your whole codebase, and in ideal conditions, with proper static typing applied, can improve speed by factors of ten or a hundred.

OTHER TIPS

You can pass numpy arrays to PyOpenGL without incurring any overhead. (The data attribute of the numpy array is a buffer that points to the underlying C data structure that contains the same information as the array you're building)

import numpy as np  
def array_numpy(points):
    n = len(points)
    return n, np.array(points, dtype=np.float32)

On my computer, this is about 40% faster than the struct-based approach.

There's another idea I stumbled across. I don't have time to profile it right now, but in case someone else does:

 # untested, but I'm fairly confident it runs
 # using 'flattened points' list, i.e. a list of n*2 floats
 points = [random() for _ in xrange(1000 * 2)]
 c_array = c_float * len(points * 2)
 c_array[:] = points

That is, first we create the ctypes array but don't populate it. Then we populate it using the slice notation. People smarter than I tell me that assigning to a slice like this may help performance. It allows us to pass a list or iterable directly on the RHS of the assignment, without having to use the *iterable syntax, which would perform some intermediate wrangling of the iterable. I suspect that this is what happens in the depths of creating pyglet's Batches.

Presumably you could just create c_array once, then just reassign to it (the final line in the above code) every time the points list changes.

There is probably an alternative formulation which accepts the original definition of points (a list of (x,y) tuples.) Something like this:

 # very untested, likely contains errors
 # using a list of n tuples of two floats
 points = [(random(), random()) for _ in xrange(1000)]
 c_array = c_float * len(points * 2)
 c_array[:] = chain(p for p in points)

If performance is an issue, you do not want to use ctypes arrays with the star operation (e.g., (ctypes.c_float * size)(*t)).

In my test packis fastest followed by the use of the array module with a cast of the address (or using the from_buffer function).

import timeit
repeat = 100
setup="from struct import pack; from random import random; import numpy;  from array import array; import ctypes; t = [random() for _ in range(2* 1000)];"
print(timeit.timeit(stmt="v = array('f',t); addr, count = v.buffer_info();x = ctypes.cast(addr,ctypes.POINTER(ctypes.c_float))",setup=setup,number=repeat))
print(timeit.timeit(stmt="v = array('f',t);a = (ctypes.c_float * len(v)).from_buffer(v)",setup=setup,number=repeat))
print(timeit.timeit(stmt='x = (ctypes.c_float * len(t))(*t)',setup=setup,number=repeat))
print(timeit.timeit(stmt="x = pack('f'*len(t), *t);",setup=setup,number=repeat))
print(timeit.timeit(stmt='x = (ctypes.c_float * len(t))(); x[:] = t',setup=setup,number=repeat))
print(timeit.timeit(stmt='x = numpy.array(t,numpy.float32).data',setup=setup,number=repeat))

The array.array approach is slightly faster than Jonathan Hartley's approach in my test while the numpy approach has about half the speed:

python3 convert.py
0.004665990360081196
0.004661010578274727
0.026358536444604397
0.0028003649786114693
0.005843495950102806
0.009067213162779808

The net winner is pack.

You can use array (notice also the generator expression instead of the list comprehension):

array("f", (u for point in points for u in point)).tostring()

Another optimization would be to keep the points flattened from the beginning.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top