ARM Assembly: Absolute Value Function: Are two or three lines faster?

Question 1

Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3.

We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles:

ASR R1, R0, #31         ; 1 cycle
ADD R0, R0, R1          ; 1 cycle
EOR R0, R0, R1          ; 1 cycle
                        ; total: 3 cycles

and your version takes two cycles:

ADD R1, R0, R0, ASR #31 ; 1 cycle
EOR R0, R1, R0, ASR #31 ; 1 cycle
                        ; total: 2 cycles

So yours is, theoretically, faster.

You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble:

Their version (adjusted for UAL syntax):

    .syntax unified
    .text
    .thumb
abs:
    asrs r1, r0, #31
    adds r0, r0, r1
    eors r0, r0, r1

Assembles to:

00000000        17c1    asrs    r1, r0, #31
00000002        1840    adds    r0, r0, r1
00000004        4048    eors    r0, r1

That's 3x2 = 6 bytes.

Your version (again, adjusted for UAL syntax):

    .syntax unified
    .text
    .thumb
abs:
    add.w r1, r0, r0, asr #31
    eor.w r0, r1, r0, asr #31

Assembles to:

00000000    eb0071e0    add.w   r1, r0, r0, asr #31
00000004    ea8170e0    eor.w   r0, r1, r0, asr #31

That's 2x4 = 8 bytes.

So instead of removing a memory fetch you've actually increased the size of the code.

But does this affect performance? My advice would be to benchmark.

Question 2

Here is a nother two instruction version:

    cmp     r0, #0
    rsblt   r0, r0, #0

Which translate to the simple code:

  if (r0 < 0)
  {
    r0 = 0-r0;
  }

That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.