It's pretty GPU-specific but if you are on NVIDIA hardware you can try using the CUDA Occupancy Calculator.
I know you are using DirectCompute, but they map to the same underlying hardware. If you look at the output of FXC you can see the shared memory size and registers per thread in the assembly. Also you can deduce the compute capability from which card you have. Compute capability is the CUDA equivalent of profiles like cs_4_0, cs_4_1, cs_5_0, etc.
The goal is to increase the "occupancy", or in other words occupancy == 100% - %idle-due-to-HW-overhead