You are right with your comment what causes and what is caused by divergence of threads in a warp. However, launching configuration mentioned by you (1 thread in each block) totally diminishes potential of GPU. Threads in a warp/half-warp is the maximal unit of threads that is eventually executed in parallel on a single multiprocessor. So, having one thread in the block and having 32 these blocks is actually as having 32 threads in the warp with different paths. First case is even worse because number resident blocks per multiprocessors is quite limited (8 or 16, depending on compute capability).
Therefore, if you want to fully exploit potential of GPU, keep in mind Jack's comment and try to reorganize threads so that threads of a single warp would follow equal execution path.