Do sse instructions consume more power/energy?

Question 1

I actually did a study on this a few years ago. The answer depends on what exactly your question is:

In today's processors, power consumption is not much determined by the type of instruction (scalar vs. SIMD), but rather everything else such as:

Memory/cache
Instruction decoding
OOE, register file
And lots others.

So if the question is:

All other things being equal: Does a SIMD instruction consume more power than a scalar instruction.

For this, I dare to say yes.

One of my graduate school projects eventually became this answer: A side-by-side comparison of SSE2 (2-way SIMD) and AVX (4-way SIMD) did in fact show that AVX had a noticably higher power consumption and higher processor temperatures. (I don't remember the exact numbers though.)

This is because the code is identical between the SSE and the AVX. Only the width of the instruction was different. And the AVX version did double the work.

But if the question is:

Will vectorizing my code to use SIMD consume more power than a scalar implementation.

There's numerous factors involved here so I'll avoid a direct answer:

Factors that reduce power consumption:

We need to remember that the point of SIMD is to improve performance. And if you can improve performance, your app will take less time to run thus saving you power.
Depending on the application and the implementation, SIMD will reduce the number instructions that are needed to do a certain task. That's because you're doing several operations per instruction.

Factors that increase power consumption:

As mentioned earlier, SIMD instructions do more work and can use more power than scalar equivalents.
Use of SIMD introduces overhead not present in scalar code (such as shuffle and permute instructions). These also need to go through the instruction execution pipeline.

Breaking it down:

Fewer instructions -> less overhead for issuing and executing them -> less power
Faster code -> run less time -> less power
SIMD takes more power to execute -> more power

So SIMD saves you power by making your app take less time. But while its running, it consumes more power per unit time. Who wins depends on the situation.

From my experience, for applications that get a worthwhile speedup from SIMD (or anything other method), the former usually wins and the power consumption goes down.

That's because run-time tends to be the dominant factor in power consumption for modern PCs (laptops, desktops, servers). The reason being that most of the power consumption is not in the CPU, but rather in everything else: motherboard, ram, hard drives, monitors, idle video cards, etc... most of which have a relatively fixed power draw.

For my computer, just keeping it on (idle) already draws more than half of what it can draw under an all-core SIMD load such as prime95 or Linpack. So if I can make an app 2x faster by means of SIMD/parallelization, I've almost certainly saved power.

Question 2

As Mystical's answer suggests, SIMD code tends to take slightly more power, but if the problem is amenable to vectorization, well-written SIMD code will run significantly faster; the speedup is almost always larger than the increase in power, which results in a decrease in the amount of energy (the integral of power over time) consumed.

This is broadly true not only for SIMD vectorization, but for nearly all optimization. Faster code is not just faster, but (almost universally) more energy efficient.

A nit about terminology: people frequently about "power" when they really want to talk about "energy". Power consumption in computing is really only relevant if you are engineering power supplies (obvious reasons) or engineering enclosures (because you want to know how much power you need to be able to disperse as heat). 99.999% of people aren't engaged in either one of those activities, and thus they really want to be keeping energy in mind (as computation / energy is the correct measure of how efficient a program is).

Question 3

This really depends on what you really want to know. Let me answer this question from the point of view of what I think a processor designer who may not care about all other power consumption (e.g. main memory) but only wants to know the power consumption in his/her piece of logic in a single core. I have two answers then.

1.) For a fixed frequency, a core with SIMD which gives a faster result likely uses more energy than a scalar core due to the extra complexity (circuit logic) of implementing SIMD.

2.) If the frequency is allowed to vary so that the scalar core finishes in the same time as the SIMD core I would argue that the SIMD core uses much less energy.

Edit: I changed the words power to energy since power is energy/time. I think the proper thing to compare is something like FLOPS/watt

Let me explain. The power of a processor goes as C*V^2*f where C is capacitance, V is voltage, and f is frequency. If you read this paper Optimizing Power using Transformations you can show that using two cores at half the frequency uses only 40% of the power of a single core at full frequency to the same calculation in the same amount of time.

I would argue that the same logic applies to other parallel methods such as SIMD and ILP (super-scalar). So instead of increasing the frequency with a scalar core if SIMD is implemented the same computation can be done in the same amount of time using much less energy (on the other had it makes the programming a lot more difficult).

GPU developers have used the principle of that paper to put them a few years ahead of Intel (by Moore's law) in processing potential. They run at lower frequencies than CPUs and use far more "cores" so for the same amount of electrical energy they get more potential processing power.