Question

In my embedded systems class, we were asked to re-code the given C-function AbsVal into ARM Assembly. We were told that the best we could do was 3-lines. I was determined to find a 2-line solution and eventually did, but the question I have now is whether I actually decreased performance or increased it.

The C-code:

unsigned long absval(signed long x){
    unsigned long int signext;
    signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction
    return (x + signet) ^ signext;
}

The TA/Professor's 3-line solution

ASR R1, R0, #31         ; R1 <- (x >= 0) ? 0 : -1
ADD R0, R0, R1          ; R0 <- R0 + R1
EOR R0, R0, R1          ; R0 <- R0 ^ R1

My 2-line solution

ADD R1, R0, R0, ASR #31 ; R1 <- x  + (x >= 0) ? 0 : -1
EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1

There are a couple of places I can see potential performance differences:

  1. The addition of one extra Arithmetic Shift Right call
  2. The removal of one memory fetch

So, which one is actually faster? Does it depend upon the processor or memory access speed?

Was it helpful?

Solution 2

Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3.

We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles:

ASR R1, R0, #31         ; 1 cycle
ADD R0, R0, R1          ; 1 cycle
EOR R0, R0, R1          ; 1 cycle
                        ; total: 3 cycles

and your version takes two cycles:

ADD R1, R0, R0, ASR #31 ; 1 cycle
EOR R0, R1, R0, ASR #31 ; 1 cycle
                        ; total: 2 cycles

So yours is, theoretically, faster.

You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble:

Their version (adjusted for UAL syntax):

    .syntax unified
    .text
    .thumb
abs:
    asrs r1, r0, #31
    adds r0, r0, r1
    eors r0, r0, r1

Assembles to:

00000000        17c1    asrs    r1, r0, #31
00000002        1840    adds    r0, r0, r1
00000004        4048    eors    r0, r1

That's 3x2 = 6 bytes.

Your version (again, adjusted for UAL syntax):

    .syntax unified
    .text
    .thumb
abs:
    add.w r1, r0, r0, asr #31
    eor.w r0, r1, r0, asr #31

Assembles to:

00000000    eb0071e0    add.w   r1, r0, r0, asr #31
00000004    ea8170e0    eor.w   r0, r1, r0, asr #31

That's 2x4 = 8 bytes.

So instead of removing a memory fetch you've actually increased the size of the code.

But does this affect performance? My advice would be to benchmark.

OTHER TIPS

Here is a nother two instruction version:

    cmp     r0, #0
    rsblt   r0, r0, #0

Which translate to the simple code:

  if (r0 < 0)
  {
    r0 = 0-r0;
  }

That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top