Thrust sorting operations require significant extra temporary storage.
nvidia-smi
is effectively sampling memory usage at various times, and the amount of memory in use at the sampling point may not be reflective of the max memory used (or required) by your application. As you've discovered cudaMemGetInfo may be more useful.
I've generally found thrust to be able to sort arrays up to about 40% of the memory on your GPU. However there is no specified number and you may need to determine it by trial and error.
Don't forget that CUDA uses some overhead memory, and if your GPU is hosting a display, that will consume additional memory as well.