You code seems to me very complicated. And some part are not very clear to me. For instance, I don't see why you create only one work item:
prg.try_this7(queue, (1,), None,...)
Which force you to loop through your strings (in the kernel) instead of using the available parallelism. Anyhow, if I well understand, you want to send some strings to the GPU copy them in another buffer, get them back in the host side and display them.
If it's the case here is a version using only numpy and of course pyopencl:
import numpy as np
import pyopencl as cl
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
#The kernel uses one workitem per char transfert
prog_str = """kernel void foo(global char *in, global char *out, int size){
int idx = get_global_id(0);
if (idx < size){
out[idx] = in[idx];
}
}"""
prog = cl.Program(ctx, prog_str).build()
#Note that the type of the array of strings is '|S40' for the length
#of third element is 40, the shape is 3 and the nbytes is 120 (3 * 40)
original_str = np.array(('this is an average string',
'and another one',
"let's push even more with a third string"))
mf = cl.mem_flags
in_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=original_str)
out_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=str_size)
copied_str = np.zeros_like(original_str)
#here launch the kernel with str_size number of workitems in this case 120
#this mean that some of the workitems won't process any meaningful char
#(not all string have a lenght of 40) but it's no biggie
prog.foo(queue, (str_size,), None, in_buf, out_buf, np.int32(str_size))
cl.enqueue_copy(queue, copied_str, out_buf).wait()
print copied_str
And the displayed result:
['this is an average string' 'and another one'
"let's push even more with a third string"]