Question

I'm trying to manipulate a special struct and I need some sort of a swizzle operator. For this it makes sense to have an overloaded array [] operator, but I don't want to have any branching since the particular specification of the struct allows for a theoretical workaround.

Currently, the struct looks like this:

struct f32x4
{
    float fLow[2];
    float fHigh[2];

    f32x4(float a, float b, float c, float d)
    {
        fLow[0] = a; 
        fLow[1] = b;
        fHigh[0] = c;
        fHigh[1] = d;
    }

    // template with an int here?
    inline float& operator[] (int x) {
        if (x < 2)
            return fLow[x];
        else
            return fHigh[x - 2];
    }
};

What could I/should I do to avoid the branch? My idea is to use a template with an integer parameter and define specializations, but it's not clear whether it does make sense and what the syntax of that monster could look like.

I explicitly, under no circumstances, can make use of a float[4] array to merge the two (also, no union tricks). If you need a good reason for that, it's because the float[2] are actually resembling a platform specific PowerPC paired singles. A normal windows compiler won't work with paired singles, that's why I replaced the code with float[2]s.

Using the GreenHills compiler I get this assembly output (which suggests branching does occur):

.LDW31:
00000050 80040000           89      lwz r0, 0(r4)
00000054 2c000000           90      cmpwi   r0, 0
00000058 41820000           91      beq .L69
                            92  #line32
                            93  
                            94  .LDWlin1:
0000005c 2c000001           95      cmpwi   r0, 1
00000060 40820000           96      bne .L74
                            97  #line32
                            98  
                            99  .LDWlin2:
00000064 38630004          100      addi    r3, r3, 4
00000068 38210018          101      addi    sp, sp, 24
0000006c 4e800020          102      blr
                           103  .L74:
00000070 2c000002          104      cmpwi   r0, 2
00000074 40820000          105      bne .L77
                           106  #line33
                           107  
                           108  .LDWlin3:
00000078 38630008          109      addi    r3, r3, 8
0000007c 38210018          110      addi    sp, sp, 24
00000080 4e800020          111      blr
                           112  .L77:
00000084 2c000003          113      cmpwi   r0, 3
00000088 40820000          114      bne .L80
                           115  #line34
                           116  
                           117  .LDWlin4:
0000008c 3863000c          118      addi    r3, r3, 12
00000090 38210018          119      addi    sp, sp, 24
00000094 4e800020          120      blr
                           121  .L80:
00000098 38610008          122      addi    r3, sp, 8
                           123  .L69:
                           124  #       .ef

The corresponding C++ code to that snippet should be this one:

 inline const float& operator[](const unsigned& idx) const
        {
            if (idx == 0)  return xy[0];
            if (idx == 1)  return xy[1];
            if (idx == 2)  return zw[0];
            if (idx == 3)  return zw[1];
            return 0.f;
        }
Was it helpful?

Solution 3

Since you said in a comment that your index is always a template parameter, then you can indeed make the branching at compile-time instead of runtime. Here is a possible solution using std::enable_if:

#include <iostream>
#include <type_traits>

struct f32x4
{
    float fLow[2];
    float fHigh[2];

    f32x4(float a, float b, float c, float d)
    {
        fLow[0] = a; 
        fLow[1] = b;
        fHigh[0] = c;
        fHigh[1] = d;
    }

    template <int x>
    float& get(typename std::enable_if<(x >= 0 && x < 2)>::type* = 0)
    {
        return fLow[x];
    }

    template <int x>
    float& get(typename std::enable_if<(x >= 2 && x < 4)>::type* = 0)
    {
        return fHigh[x-2];
    }
};

int main()
{
    f32x4 f(0.f, 1.f, 2.f, 3.f);

    std::cout << f.get<0>() << " " << f.get<1>() << " "
              << f.get<2>() << " " << f.get<3>(); // prints 0 1 2 3
}

Regarding performance, I don't think there will be any difference since the optimizer should be able to easily propagate the constants and remove dead code subsequently, thereby removing the branch altogether. However, with this approach, you get the benefit that any attempts to invoke the function with an invalid index will result in a compiler error.

OTHER TIPS

Either the index x is a runtime variable, or a compile-time constant.

  • if it is a compile-time constant, there's a good chance the optimizer will be able to prune the dead branch when inlining operator[] anyway.

  • if it is a runtime variable, like

    for (int i=0; i<4; ++i) { dosomething(f[i]); }
    

    you need the branch anyway. Unless, of course, your optimizer unrolls the loop, in which case it can replace the variable with four constants, inline & prune as above.

Did you profile this to show there's a real problem, and compile it to show the branch really happens where it could be avoided?


Example code:

float foo(f32x4 &f)
{
    return f[0]+f[1]+f[2]+f[3];
}

example output from g++ -O3 -S

.globl _Z3fooR5f32x4
        .type       _Z3fooR5f32x4, @function
_Z3fooR5f32x4:
.LFB4:
        .cfi_startproc
        movss       (%rdi), %xmm0
        addss       4(%rdi), %xmm0
        addss       8(%rdi), %xmm0
        addss       12(%rdi), %xmm0
        ret
        .cfi_endproc

Seriously, don't do this!! Just combine the arrays. But since you asked the question, here's an answer:

#include <iostream>

float fLow [2] = {1.0,2.0};
float fHigh [2] = {50.0,51.0};

float * fArrays[2] = {fLow, fHigh};

float getFloat (int i)
{
    return fArrays[i>=2][i%2];
}

int main()
{
    for (int i = 0; i < 4; ++i)
        std::cout << getFloat(i) << '\n';
    return 0;
}

Output:

1
2
50
51

Create one array (or vector) with all 4 elements in it, the fLow values occupying the first two positions, then high in the second 2. Then just index into it.

inline float& operator[] (int x) {
    return newFancyArray[x]; //But do some bounds checking above.
}

Based on Luc Touraille's answer, without using type traits due to their lack of compiler support, I found the following to achieve the purpose of the question. Since the operator[] can not be templatized with an int parameter and work syntactically, I introduced an at method. This is the result:

struct f32x4
{
    float fLow[2];
    float fHigh[2];

    f32x4(float a, float b, float c, float d)
    {
        fLow[0] = a; 
        fLow[1] = b;
        fHigh[0] = c;
        fHigh[1] = d;
    }


    template <unsigned T>
    const float& at() const;

};
template<>
const float& f32x4::at<0>() const { return fLow[0]; }
template<>
const float& f32x4::at<1>() const { return fLow[1]; }
template<>
const float& f32x4::at<2>() const { return fHigh[0]; }
template<>
const float& f32x4::at<3>() const { return fHigh[1]; }
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top