Question

The code below converts a row from an 8-Bit paletized format to 32-RGBA.

Before I trying to implement it, I would like to know if the code below is even suited for being optimized with Direct-Math or alternatively ARM Neon intrinsics or inline assembly. My first look at the documentation did not reveal anything that would cover the table-lookup part.

void CopyPixels(BYTE *pDst, BYTE *pSrc, int width,
  const BYTE mask, Color* pColorTable)
{
  if (width)
  {
    do
    {
      BYTE b = *pSrc++;
      if (b != mask)
      {
        // Translate to 32-bit RGB value if not masked
        const Color* pColor = pColorTable + b;
        pDst[0] = pColor->Blue;
        pDst[1] = pColor->Green;
        pDst[2] = pColor->Red;
        pDst[3] = 0xFF;
      }
      // Skip to next pixel
      pDst += 4;
    }
    while (--width);
  }
}
Was it helpful?

Solution

You will need a LUT of size 256*4bytes = 1024bytes. This kind of job is not suited for SIMD at all. (except for the SSE part on Intel's new Haswell core)

NEON can handle LUTs of maximum 32bytes in size with VTBL and VTBX, but it's more or less meant to work in conjunction with CLZs as starting values for Newton-Raphson iterations.

OTHER TIPS

I agree with Jake that this isn't a great vector processor problem, and may be more efficiently handled by the ARM main pipeline. That doesn't mean that you couldn't optimize it by assembly (but just plain ARM v7) for drastically improved results.

In particular, a simple improvement would be to construct your lookup table such that it can be used with a word sized copy. This would involve making sure the Color struct follows the 32-RGBA format, including having the 4th 0xFF as part of the lookup, so that you can just do a single word copy. This could be a significant performance boost with no assembly required, since it is a single memory fetch, rather than 3 (plus a constant assignment).

void CopyPixels(RGBA32Color *pDst, BYTE const *pSrc, int width,
  const BYTE mask, RGBA32Color const *pColorTable)
{
  if (width)
  {
    do
    {
      BYTE b = *pSrc++;
      if (b != mask)
      {
        // Translate to 32-bit RGB value if not masked
        *pDst = pColorTable[b];
      }
      // Skip to next pixel
      pDst ++;
    }
    while (--width);
  }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top