I agree with Jake that this isn't a great vector processor problem, and may be more efficiently handled by the ARM main pipeline. That doesn't mean that you couldn't optimize it by assembly (but just plain ARM v7) for drastically improved results.
In particular, a simple improvement would be to construct your lookup table such that it can be used with a word sized copy. This would involve making sure the Color
struct follows the 32-RGBA format, including having the 4th 0xFF as part of the lookup, so that you can just do a single word copy. This could be a significant performance boost with no assembly required, since it is a single memory fetch, rather than 3 (plus a constant assignment).
void CopyPixels(RGBA32Color *pDst, BYTE const *pSrc, int width,
const BYTE mask, RGBA32Color const *pColorTable)
{
if (width)
{
do
{
BYTE b = *pSrc++;
if (b != mask)
{
// Translate to 32-bit RGB value if not masked
*pDst = pColorTable[b];
}
// Skip to next pixel
pDst ++;
}
while (--width);
}
}