SIMD 或非 SIMD - 跨平台

https://stackoverflow.com/questions/2122573

22-09-2019
|

题

我需要一些想法如何编写一些可并行问题的 C++ 跨平台实现，以便我可以利用 SIMD（SSE、SPU 等）（如果可用）。我希望能够在运行时在 SIMD 和非 SIMD 之间切换。

您建议我如何解决这个问题？（当然我不想针对所有可能的选项多次实现该问题）

我可以看到这对于 C++ 来说可能不是一件容易的任务，但我相信我错过了一些东西。到目前为止我的想法看起来像这样......类 cStream 将是单个字段的数组。使用多个 cStream，我可以实现 SoA（数组结构）。然后使用一些 Functor，我可以伪造需要在整个 cStream 上执行的 Lambda 函数。

// just for example I'm not expecting this code to compile
cStream a; // something like float[1024]
cStream b;
cStream c;

void Foo()
{
    for_each(
        AssignSIMD(c, MulSIMD(AddSIMD(a, b), a)));
}

其中 for_each 将负责递增流的当前指针，以及使用 SIMD 和不使用 SIMD 内联函子主体。

像这样：

// just for example I'm not expecting this code to compile
for_each(functor<T> f)
{
#ifdef USE_SIMD
    if (simdEnabled)
        real_for_each(f<true>()); // true means use SIMD
    else
#endif
        real_for_each(f<false>());
}

请注意，如果 SIMD 已启用，则会检查一次，并且循环围绕主函子。

解决方案 2

如果有人有兴趣，这是肮脏的代码，我来来测试，我想出了一边念叨库，保罗发布了一个新的想法。

感谢保罗！

// This is just a conceptual test
// I haven't profile the code and I haven't verified if the result is correct
#include <xmmintrin.h>


// This class is doing all the math
template <bool SIMD>
class cStreamF32
{
private:
    void*       m_data;
    void*       m_dataEnd;
    __m128*     m_current128;
    float*      m_current32;

public:
    cStreamF32(int size)
    {
        if (SIMD)
            m_data = _mm_malloc(sizeof(float) * size, 16);
        else
            m_data = new float[size];
    }
    ~cStreamF32()
    {
        if (SIMD)
            _mm_free(m_data);
        else
            delete[] (float*)m_data;
    }

    inline void Begin()
    {
        if (SIMD)
            m_current128 = (__m128*)m_data;
        else
            m_current32 = (float*)m_data;
    }

    inline bool Next()
    {
        if (SIMD)
        {
            m_current128++;
            return m_current128 < m_dataEnd;
        }
        else
        {
            m_current32++;
            return m_current32 < m_dataEnd;
        }
    }

    inline void operator=(const __m128 x)
    {
        *m_current128 = x;
    }
    inline void operator=(const float x)
    {
        *m_current32 = x;
    }

    inline __m128 operator+(const cStreamF32<true>& x)
    {
        return _mm_add_ss(*m_current128, *x.m_current128);
    }
    inline float operator+(const cStreamF32<false>& x)
    {
        return *m_current32 + *x.m_current32;
    }

    inline __m128 operator+(const __m128 x)
    {
        return _mm_add_ss(*m_current128, x);
    }
    inline float operator+(const float x)
    {
        return *m_current32 + x;
    }

    inline __m128 operator*(const cStreamF32<true>& x)
    {
        return _mm_mul_ss(*m_current128, *x.m_current128);
    }
    inline float operator*(const cStreamF32<false>& x)
    {
        return *m_current32 * *x.m_current32;
    }

    inline __m128 operator*(const __m128 x)
    {
        return _mm_mul_ss(*m_current128, x);
    }
    inline float operator*(const float x)
    {
        return *m_current32 * x;
    }
};

// Executes both functors
template<class T1, class T2>
void Execute(T1& functor1, T2& functor2)
{
    functor1.Begin();
    do
    {
        functor1.Exec();
    }
    while (functor1.Next());

    functor2.Begin();
    do
    {
        functor2.Exec();
    }
    while (functor2.Next());
}

// This is the implementation of the problem
template <bool SIMD>
class cTestFunctor
{
private:
    cStreamF32<SIMD> a;
    cStreamF32<SIMD> b;
    cStreamF32<SIMD> c;

public:
    cTestFunctor() : a(1024), b(1024), c(1024) { }

    inline void Exec()
    {
        c = a + b * a;
    }

    inline void Begin()
    {
        a.Begin();
        b.Begin();
        c.Begin();
    }

    inline bool Next()
    {
        a.Next();
        b.Next();
        return c.Next();
    }
};


int main (int argc, char * const argv[]) 
{
    cTestFunctor<true> functor1;
    cTestFunctor<false> functor2;

    Execute(functor1, functor2);

    return 0;
}

其他提示

您可能想看看源的MacSTL库在这方面的一些想法： www.pixelglow的.com / macstl /

你可能想看看我试图在单指令/非单指令:

vrep, 一模板的基类与专业领域为单指令(注意它是如何区别之间浮动-只能证，并SSE2，其中介绍了整数量。).
更有用的 v4f, v4i 等类(子类通过中间 v4).

当然这是迄今为止更多的面向4-元素的矢量 rgba/xyz 类型的计算比SoA，因此将完全运行的蒸汽时8-方式运营，但一般原则可能是有用的。

我见过的最令人印象深刻的 SIMD 缩放方法是 RTFact 光线追踪框架：幻灯片, 纸. 。非常值得一看。研究人员与英特尔关系密切（萨尔布吕肯现在是英特尔视觉计算研究所的所在地），因此您可以确信他们正在考虑向前扩展到 AVX 和 Larrabee。

英特尔的 CT “数据并行”模板库看起来也很有前途。

注意，给出的例子决定如何在编译时执行（因为你正在使用的预处理器），在这种情况下，你可以使用更复杂的技术来决定你真正要执行什么;例如，标签调度： http://cplusplus.co.il/2010 / 01/03 /标签调度/ 继有所示的例子，你可以有快速实现与SIMD，慢无。

你有没有想过像使用现有的解决方案 liboil ？它实现很多共同SIMD操作并可以在运行时决定是否使用SIMD /非SIMD代码（使用由初始化功能分配的功能指针）。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow