They need GPU drivers. For Intel CPU, they may manually download the necessary binaries.
AMD device compiler's compiling action takes some time while Nvidia's can compile quickly. Compiling time is very low when you target CPU. I converted a basic C++ fluid&raytracer simulation into opencl version and it compiled after 3 minutes!(I mean device opencl-c compiling of kernels) If you want to give people an already-compiled project, then you would need to have every single type of card on your access and compile&save binaries for all of them.
Some gl-cl-dx sharing operations can be incompatible between vendors.
Dont use platform-specific constants, they may not be mapped fully on other platforms.
Tell people your targeted opencl version.
Dont use larger than 256 local work group size for GPU computing. AMD GPUs' maximum local work group size is 256 while Nvidia's is 1024.
Dont spill private registers, decrease depth of pseudo-recursive functions if you need it badly. Sometimes AMD compiler tries to optimize so much that it explodes at native device compile time.
Use a "platform & device query wrapper" of your own that finds a proper gpu, dont just get platform[0] or device[0]. Users may have multiple platforms such as Intel's for CPU and AMD's for GPU, maybe all of them. APUs' included GPUs may be known as ACC instead of GPU(Im not sure about this)
Your implicit synchronization of kernels&buffer_transfers can successfully run on your system while not on other systems.
Check if your dlls or app is same bitness with other peoples' machine&OS. If you target 64 bit and they have 32bit OS then it will not work.