Efficient way to convert string to ctypes.c_ubyte array in Python

Question 1

I don't that's working how you think. bytearray creates a copy of the string. Then the interpreter unpacks the bytearray sequence into a starargs tuple and merges this into another new tuple that has the other args (even though there are none in this case). Finally, the c_ubyte array initializer loops over the args tuple to set the elements of the c_ubyte array. That's a lot of work, and a lot of copying, to go through just to initialize the array.

Instead you can use the from_buffer_copy method, assuming the string is a bytestring with the buffer interface (not unicode):

import ctypes    
str_bytes = '01234567890123456789'
raw_bytes = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)

That still has to copy the string, but it's only done once, and much more efficiently. As was stated in the comments, a Python string is immutable and could be interned or used as a dict key. Its immutability should be respected, even if ctypes lets you violate this in practice:

>>> from ctypes import *
>>> s = '01234567890123456789'
>>> b = cast(s, POINTER(c_ubyte * 20))[0]
>>> b[0] = 97
>>> s
'a1234567890123456789'

Edit

I need to emphasize that I am not recommending using ctypes to modify an immutable CPython string. If you have to, then at the very least check sys.getrefcount beforehand to ensure that the reference count is 2 or less (the call adds 1). Otherwise, you will eventually be surprised by string interning for names (e.g. "sys") and code object constants. Python is free to reuse immutable objects as it sees fit. If you step outside of the language to mutate an 'immutable' object, you've broken the contract.

For example, if you modify an already-hashed string, the cached hash is no longer correct for the contents. That breaks it for use as a dict key. Neither another string with the new contents nor one with the original contents will match the key in the dict. The former has a different hash, and the latter has a different value. Then the only way to get at the dict item is by using the mutated string that has the incorrect hash. Continuing from the previous example:

>>> s
'a1234567890123456789'
>>> d = {s: 1}
>>> d[s]
1

>>> d['a1234567890123456789']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'a1234567890123456789'

>>> d['01234567890123456789']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '01234567890123456789'

Now consider the mess if the key is an interned string that's reused in dozens of places.

For performance analysis it's typical to use the timeit module. Prior to 3.3, timeit.default_timer varies by platform. On POSIX systems it's time.time, and on Windows it's time.clock.

import timeit

setup = r'''
import ctypes, numpy
str_bytes = '01234567890123456789'
arr_t = ctypes.c_ubyte * 20
'''

methods = [
  'arr_t(*bytearray(str_bytes))',
  'arr_t.from_buffer_copy(str_bytes)',
  'ctypes.cast(str_bytes, ctypes.POINTER(arr_t))[0]',
  'numpy.asarray(str_bytes).ctypes.data_as('
      'ctypes.POINTER(arr_t))[0]',
]

test = lambda m: min(timeit.repeat(m, setup))

>>> tabs = [test(m) for m in methods]
>>> trel = [t / tabs[0] for t in tabs]
>>> trel
[1.0, 0.060573711879182784, 0.261847116395079, 1.5389279092185282]

Question 2

As another solution for you to benchmark (I would be very interested in the results).

Using numpy might add some simplicity depending on what the whole code looks like.

import numpy as np
import ctypes
str_bytes = '01234567890123456789'
arr = np.asarray(str_bytes)
aa = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_ubyte*len(str_bytes)))
for v in aa.contents: print v
48
49
50
51
52
53
54
55
56
57
48
49
50
51
52
53
54
55
56
57