TL;DR answer: more of the transistors in a gpu are actually working on the computation than in a cpu.
The big power efficiency-killer of today's cpus is a trade-off to allow general computation on the chip. Whether it is a RISC, x86, or other cpu architecture, there is extra hardware dedicated to the general purpose usage of the cpu. These transistors require electricity, although they are not doing any actual math.
Fast cpus require advanced branch prediction hardware and large cache memory to be able to avoid lengthy processing which could be discarded later in the pipeline. For the most part, cpus execute their instruction one at a time (per cpu core, SIMD helps out cpus as well...), and handle conditions extremely well. Gpus rely on doing the same operation on many pieces of data (SIMD/vector operation), and suffer greatly with simple conditions found in 'if' and 'for' statements.
There is also a lot of hardware used to fetch, decode, and schedule instructions -- this is true for cpus and gpus. This big difference being that the ratio of fetch+decode+schedule transistors to computating transistors tends to be much higher for a gpu.
Here is an AMD presentation (2011) about how their gpus have changed over time, but this really applies to most gpus in general. PDF link. It helped me understand the power advantage of gpus by knowing a bit of the history behind how gpus got to be so good at certain computations.
I gave an answer to a similar question a while ago. SO Link.