Frage

When I try to use the script underneath to get the data back to the cpu, there is an error. I don't get an error when I try to put some values in "ref" if I would just put:

ref[1] = 255; ref[0] = 255; ref[2] = 255;

but if I do something like this:

if (verschil.a[idx+idy*640]>5){
  ref[1] = 255; ref[0] = 255; ref[2] = 255;
}

the error message I get is:

Traceback (most recent call last):
File "./zwartwit.py", line 159, in <module>
verwerking(cuda.InOut(refe),cuda.InOut(frame), block=(640, 1, 1))
File "/usr/lib/python2.7/dist-packages/pycuda/driver.py", line 374, in function_call
func._launch_kernel(grid, block, arg_buf, shared, None)
pycuda._driver.LogicError: cuLaunchKernel failed: invalid value

Thanks for the help!

ps, this is a symplified version of the script I was talking about. to get the same error, the // must be removed.

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import cv2
from time import time,sleep

mod = SourceModule("""
struct legear{ int a[307200];};
__global__ void totaal(int *ref){ 
  int idx = threadIdx.x + blockIdx.x *  blockDim.x;
  legear test;
  for (int idy=0;idy<480;idy++){
    if (idy < 480){
      if (idx < 640){
    if (ref[idx*3+idy*640*3]>100){
      test.a[idx+idy*640] = 255;
    }
    //if (test.a[idx+idy*640] == 255){
      ref[idx*3+idy*640*3] = 255; ref[idx*3+idy*640*3+1] = 255; ref[idx*3+idy*640*3+2] = 255; 
    //}
      }
    }
  }
  }
""")

camera = cv2.VideoCapture(0)
im2 = numpy.zeros((768, 1024, 1 ),dtype=numpy.uint8)
cv2.imshow("projector", im2)
key = cv2.waitKey(100)
for i in range(0,8):
  refe = camera.read()[1]
im2[500:502] = [100]
cv2.imshow("projector", im2)
key = cv2.waitKey(100)
verwerking = mod.get_function("totaal")
refe = refe.astype(numpy.int32)
verwerking(cuda.InOut(refe), block=(640, 1, 1))
refe = refe.astype(numpy.uint8)
cv2.imshow("test", refe)
cv2.waitKey(200)
raw_input()
War es hilfreich?

Lösung

The basic problem here is the size of test inside your kernel. As you have written it, every thread requires 1228800 bytes of local memory. The runtime must reserve that memory for every thread - so your code would require 750Mb of free memory to allocate for local memory on the device to support the 640 threads per block you are trying to launch. My guess is that your device doesn't have that amount of free memory.

The reason why the code you have shown works without the if statement is down to compiler optimisation - in that case test isn't actually used for anything and the compiler simply removes it from the code, which eliminates the huge local memory footprint of the kernel and allows it to run. When you uncomment the if statement, test determines the state of a global memory write, thus the compiler cannot optimise it away and the kernel requires a large amount local memory to run.

This is the compiler output I see with the kernel code as you posted it:

> nvcc -arch=sm_21 -Xptxas="-v" -m32 -c wnkr_py.cu
wnkr_py.cu
wnkr_py.cu(7): warning: variable "test" was set but never used

tmpxft_00000394_00000000-5_wnkr_py.cudafe1.gpu
tmpxft_00000394_00000000-10_wnkr_py.cudafe2.gpu
wnkr_py.cu
wnkr_py.cu(7): warning: variable "test" was set but never used

ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z6totaalPi' for 'sm_21'
ptxas : info : Function properties for _Z6totaalPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 8 registers, 36 bytes cmem[0], 4 bytes cmem[16]
tmpxft_00000394_00000000-5_wnkr_py.cudafe1.cpp
tmpxft_00000394_00000000-15_wnkr_py.ii

Note the compiler warning and the stack frame size.

With the if statement active:

>nvcc -arch=sm_21 -Xptxas="-v" -m32 -c wnkr_py.cu
wnkr_py.cu
tmpxft_000017c8_00000000-5_wnkr_py.cudafe1.gpu
tmpxft_000017c8_00000000-10_wnkr_py.cudafe2.gpu
wnkr_py.cu
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z6totaalPi' for 'sm_21'
ptxas : info : Function properties for _Z6totaalPi
    1228800 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 7 registers, 36 bytes cmem[0]
tmpxft_000017c8_00000000-5_wnkr_py.cudafe1.cpp
tmpxft_000017c8_00000000-15_wnkr_py.ii

Note the stack frame size changes to 1228800 bytes per thread.

My quick reading of the code suggests that test doesn't need to be anything like as large as you have defined it for the code to run, but I leave the required size as an exercise to the reader....

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top