Expression Template implementation not being optimized

Question 1

The point of expression templates is that the evaluation of the subexpressions can lead to temporaries that would incur a cost and provide no benefit. In your code you are not really comparing apples to apples. The two alternative to compare would be:

// Traditional
vector operator+(vector const& lhs, vector const& rhs);
vector operator-(vector const& lhs, vector const& rhs);
vector operator*(vector const& lhs, vector const& rhs);

With those definitions for the operations, the expression that you want solved:

v4 = ((v0 - v1) + (v2 * v3)) + v4;

Becomes (providing names to all temporaries):

auto __tmp1 = v0 - v1;
auto __tmp2 = v2 * v3;
auto __tmp3 = __tmp1 + __tmp2;
auto __tmp4 = __tmp3 + v4;
// assignment is not really part of the expression
v4 = __tmp4;

As you see there are 4 temporary objects, which if you use expression templates get reduced to the bare minimum: a single temporary since any of those operations generates an out-of-place value.

In your hand rolled version of the code you are not performing the same operations, you are rather unrolling the whole loop and taking advantage of knowledge of the complete operation, and not really the same operation, since by knowing that you would assign at the end of the expression to one of the elements, you transformed the expression into:

v4 += ((v0 - v1) + (v2 * v3));

Now consider what would happen if instead of assigning to one of the vectors that takes part of the expression, you were creating a new vector v5. Try the expression:

auto v5 = ((v0 - v1) + (v2 * v3)) + v4;

The magic of expression templates is that you can provide an implementation for the operators that work on the template that is just as efficient as the manual implementation, and user code is much simpler and less error prone (no need to iterate over all of the elements of the vectors with the potential for errors, or cost of maintenance as the internal representation of the vectors need to be known at each place where an arithmetic operation is performed)

I essentially want to update v4 in place without the copy

With expression templates and your current interface for the vector, you are going to pay for the temporary and the copy. The reason is that during the (conceptual) evaluation of the expression a new vector is created, while it might seem obvious for you that v4 = ... + v4; is equivalent to v4 += ..., that transformation cannot be done by the compiler or the expression template. You could, on the other hand, provide an overload of vector::operator+= (maybe even operator=) that takes an expression template, and does the operation in place.

Providing the assignment operator that assigns from the expression template and building with g++4.7 -O2 this is the generated assembly for both loops:

    call    __ZNSt6chrono12system_clock3nowEv   |    call    __ZNSt6chrono12system_clock3nowEv  
    movl    $5000000, %ecx                      |    movl    $5000000, %ecx                     
    xorpd   %xmm0, %xmm0                        |    xorpd   %xmm0, %xmm0                       
    movsd   2064(%rsp), %xmm3                   |    movsd   2064(%rsp), %xmm3                  
    movq    %rax, %rbx                          |    movq    %rax, %rbx                         
    .align 4                                    |    .align 4                                   
L9:                                             |L15:                                           
    xorl    %edx, %edx                          |    xorl    %edx, %edx                         
    jmp L8                                      |    jmp L18                                    
    .align 4                                    |    .align 4                                   
L32:                                            |L16:                                           
    movsd   2064(%rsp,%rdx,8), %xmm3            |    movsd   2064(%rsp,%rdx,8), %xmm3           
L8:                                             |L18:                                           
    movsd   1552(%rsp,%rdx,8), %xmm1            |    movsd   1040(%rsp,%rdx,8), %xmm2           
    movsd   16(%rsp,%rdx,8), %xmm2              |    movsd   16(%rsp,%rdx,8), %xmm1             
    mulsd   1040(%rsp,%rdx,8), %xmm1            |    mulsd   1552(%rsp,%rdx,8), %xmm2           
    subsd   528(%rsp,%rdx,8), %xmm2             |    subsd   528(%rsp,%rdx,8), %xmm1            
    addsd   %xmm2, %xmm1                        |    addsd   %xmm2, %xmm1                       
    addsd   %xmm3, %xmm1                        |    addsd   %xmm3, %xmm1                       
    movsd   %xmm1, 2064(%rsp,%rdx,8)            |    movsd   %xmm1, 2064(%rsp,%rdx,8)           
    addq    $1, %rdx                            |    addq    $1, %rdx                           
    cmpq    $64, %rdx                           |    cmpq    $64, %rdx                          
    jne L32                                     |    jne L16                                    
    movsd   2064(%rsp), %xmm3                   |    movsd   2064(%rsp), %xmm3                  
    subq    $1, %rcx                            |    subq    $1, %rcx                           
    addsd   %xmm3, %xmm0                        |    addsd   %xmm3, %xmm0                       
    jne L9                                      |    jne L15                                    
    movsd   %xmm0, (%rsp)                       |    movsd   %xmm0, (%rsp)                      
    call    __ZNSt6chrono12system_clock3nowEv   |    call    __ZNSt6chrono12system_clock3nowEv

Question 2

C++11 introduced move semantics to reduce the number of needless copies.

Your code is fairly obfuscated, but I think this should do the trick

In your struct vec replace

value_type d[N];

with

std::vector<value_type> d;

and add d(N) to the constructor initialization list. std::array is the obvious choice, but that would mean moving each element (i.e. the copy you're trying to avoid).

then add a move constructor:

vec(vec&& from): d(std::move(from.d))
{
}

The move constructor lets the new object "steal" the contents of the old one. In other words, instead of copying the entire vector (array) only the pointer to the array is copied.