SOLVED: Got it working by recompiling everything to 64bit. Quite stupid as it seems, but I thought that OpenCL was working similar to OpenGL, where you can easily allocate e.g. 8GB texture memory from a 32bit process and the driver handles the details for you (ofcourse you can't allocate 8GB in one sweep, but e.g. transfer multiple textures that add up to more that 4GB).
I still think that limiting the OpenCL memory abstraction to the adress space of the process (at least for intel/amd drivers) is irritating, but maybe there are some subtle details or performance tradeoff why this implementation was chosen.