ARM NEON Optimization for image transformation

Question

You create FOUR lookup tables 8 times as large as the original image?

You put an unnecessary if statement in the inner most loop?

What about swapping i and j?

Honestly, your question should be tagged with [c] instead of arm, neon, or image-processing to start with.

Since you didn't show what funcY and funcX do, the best answer I can give is following. (Performance skyrocketed. And it's something really really fundamental)

//Temporary tables for the source
pTemp = fromYUV;
for (j = 0; j < height; j+=2)
{
    for (i = 0; i < width; i+=2) {
       *pTemp++ = funcY(i, j) * width + funcX(i, j);
       *pTemp++ = funcY(i+1, j) * width + funcX(i+1, j);
       *pTemp++ = funcY(i, j) / 2 * width + ((int)(funcX(i, j) / 2)) * 2;
   }
    for (i = 0; i < width; i+=2) {
       *pTemp++ = funcY(i, j+1) * width + funcX(i, j+1);
       *pTemp++ = funcY(i+1, j+1) * width + funcX(i+1, j+1);
   }
}

* Process done at each frame *
pTemp = fromYUV;
pTempY = destY;
pTempUV = destUV;
for (j = 0; j < height; j+=2)
{
    for (i = 0; i < width; i+=2) {
        *pTempY++ = srcY[*pTemp++];
        *pTempY++ = srcY[*pTemp++];
        *pTempUV++ = srcUV[*pTemp++];
    }
    for (i = 0; i < width; i+=2) {
        *pTempY++ = srcY[*pTemp++];
        *pTempY++ = srcY[*pTemp++];
    }
}

You should avoid (when possible) :

access on multiple memory area
random memory access
if statements within loops

The worst crime you committed is the order of i and j. (Which you don't need to start with)

If you access a pixel at the coordinate x and y, it's pixel = image[y][x] and NOT image[x][y]