ARM neon performance issue

https://stackoverflow.com/questions/17027785

31-05-2022
|

Question

Consider the two following pieces of code, the first is the C version :

void __attribute__((no_inline)) proj(uint8_t * line, uint16_t length)
{
    uint16_t i;
    int16_t tmp;
    for(i=HPHD_MARGIN; i<length-HPHD_MARGIN; i++) {
        tmp = line[i-3] - 4*line[i-2] + 5*line[i-1] - 5*line[i+1] + 4*line[i+2] - line[i+3];
        hphd_temp[i]=ABS(tmp);
    }
}

The second is the same function (except for the border) using neon intrinsics

void  __attribute__((no_inline)) proj_neon(uint8_t * line, uint16_t length)
{
    int i;
    uint8x8_t b0b7, b8b15, p1p8,p2p9,p4p11,p5p12,p6p13, m4, m5;
    uint16x8_t result;

    m4 = vdup_n_u8(4);
    m5 = vdup_n_u8(5);
    b0b7 = vld1_u8(line);
    for(i = 0; i < length - 16; i+=8) {
        b8b15 = vld1_u8(line + i + 8);
        p1p8  = vext_u8(b0b7,b8b15, 1);
        p2p9  = vext_u8(b0b7,b8b15, 2);
        p4p11 = vext_u8(b0b7,b8b15, 4);
        p5p12 = vext_u8(b0b7,b8b15, 5);
        p6p13 = vext_u8(b0b7,b8b15, 6);

        result = vsubl_u8(b0b7, p6p13); //p[-3]
        result = vmlal_u8(result, p2p9, m5); // +5 * p[-1];
        result = vmlal_u8(result, p5p12, m4);// +4 * p[1];
        result = vmlsl_u8(result, p1p8, m4); //-4 * p[-2];
        result = vmlsl_u8(result, p4p11, m5);// -5 * p[1];
        vst1q_s16(hphd_temp + i + 3, vabsq_s16(vreinterpretq_s16_u16(result)));
        b0b7 = b8b15;
    }
    /* todo : remaining pixel */

}

I am disappointed by the performance gain : it is around 10 - 15 %. If I look at the generated assembly :

C version is transformed in a 108 instruction loop
Neon version is transformed in a 72 instruction loop.

But one loop in the neon code computes 8 times as much data as an iteration through the C loop, so a dramatic improvement should be seen.

Do you have any explanation regarding the small difference between the two version ?

Additional details : Test data is a 10 Mpix image, computation time is around 2 seconds for the C version.

CPU : ARM cortex a8

Solution

I'm going to take a wild guess and say that caching (data) is the reason you don't see the big performance gain you are expecting. While I don't know if your chipset supports caching or at what level, if the data spans cache lines, has poor alignment, or is running in an environment where the CPU is doing other things at the same time (interrupts, threads, etc.), then that also could muddy your results.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow