Probably you are compiling for a cc 1.x device. The documentation indicates that global kernel parameters are passed via shared memory for cc 1.x devices.
So you have 16,384 bytes for your explicit __shared__
declarations.
The remainder would be from the 28 bytes (assuming 64 bit target) required by your explicit kernel parameters plus other overhead that is communicated via shared memory.
Try compiling for a cc 2.x device:
nvcc -arch=sm_20 ...