Convert C-code to ARM Cortex M3 Assembler Code

Question 1

I see just 3 fairly simple problems there:

BE _next          ; if statement by "branch"-cmd
...
sub R0, R0, #1    ; loop counting
BLPL _for_loop    ; pl = if positive or zero

BEQ, not BE - condition codes are always 2 letters.
SUB alone won't update the flags - you need the suffix to say so i.e. SUBS.
BLPL would branch and link, thus overwriting your return address - you want BPL. Actually, BLPL wouldn't assemble here anyway, since in Thumb a conditional BL would need an IT to set it up (unless of course your assembler is clever enough to insert one automatically).

Edit: there's also of course a more general issue with the use of R4 in both the original code and my examples below - if you're interfacing with C code the original value must be preserved across the function call and restored afterwards (R0-R3 are designated argument/scratch registers and can be freely modified). If you're in pure assembly however you don't necessarily need to follow a standard calling convention so can be more flexible.

Now, that's a very literal representation of the C code, and doesn't make best use of the instruction set - in particular the indexed addressing modes. One of the attractions of assembly programming is having complete control of the instructions, so how can we make it worth our while?

First, let's make the C code look a little more like the assembly we want:

int main_compare (int nbytes, char *pmem1, char *pmem2){
    while(nbytes-- > 0) {    
        if(*pmem1++ != *pmem2++) {
            return 0;
        }
    }
    return 1;
}

Now that that shows our intent more clearly, let's play compiler:

byte_cmp_loop PROC
; assuming: r0 = nbytes, r1=pmem1, r2 = pmem2

_loop:
    SUBS R0, R0, #1   ; Decrement nbytes and set flags based on the result
    BMI  _finished    ; If nbytes is now negative, it was 0, so we're done

    LDRB R3, [R1], #1 ; Load from the address in R1, then add 1 to R1
    LDRB R4, [R2], #1 ; ditto for R2
    CMP R3, R4        ; If they match...
    BEQ _loop         ; then continue round the loop

    MOV R0, #0        ; else give up and return zero
    BX LR

_finished:
    MOV R0, #1        ; Success!
    BX LR
ENDP

And that's nearly 25% fewer instructions! Now if we pull in another instruction set feature - conditional execution - and relax the requirements slightly, without breaking C semantics, it gets smaller still:

byte_cmp_loop PROC
; assuming: r0 = nbytes, r1=pmem1, r2 = pmem2

_loop:
    SUBS R0, R0, #1 ; In C zero is false and any nonzero value is true, so
                    ; when R0 becomes -1 to trigger this branch, we can just
                    ; return that to indicate success
    IT MI           ; Make the following instruction conditional on 'minus'
    BXMI LR

    LDRB R3, [R1], #1
    LDRB R4, [R2], #1
    CMP R3, R4
    BEQ _loop

    MOVS R0, #0     ; Using MOVS rather than MOV to get a 16-bit encoding,
                    ; since updating the flags won't matter at this point
    BX LR
ENDP

assembling to a meagre 22 bytes, that's nearly 40% less code than we started with :D

Question 2

Well, here is some compiler generated code

arm-none-eabi-gcc -O2 -mthumb -c test.c -o test.o
arm-none-eabi-objdump -D test.o

00000000 <main_compare>:
   0:   b510        push    {r4, lr}
   2:   3801        subs    r0, #1
   4:   d502        bpl.n   c <main_compare+0xc>
   6:   e007        b.n 18 <main_compare+0x18>
   8:   3801        subs    r0, #1
   a:   d305        bcc.n   18 <main_compare+0x18>
   c:   5c0c        ldrb    r4, [r1, r0]
   e:   5c13        ldrb    r3, [r2, r0]
  10:   429c        cmp r4, r3
  12:   d0f9        beq.n   8 <main_compare+0x8>
  14:   2000        movs    r0, #0
  16:   e000        b.n 1a <main_compare+0x1a>
  18:   2001        movs    r0, #1
  1a:   bc10        pop {r4}
  1c:   bc02        pop {r1}
  1e:   4708        bx  r1

arm-none-eabi-gcc -O2 -mthumb -mcpu=cortex-m3 -c test.c -o test.o
arm-none-eabi-objdump -D test.o

00000000 <main_compare>:
   0:   3801        subs    r0, #1
   2:   b410        push    {r4}
   4:   d503        bpl.n   e <main_compare+0xe>
   6:   e00a        b.n 1e <main_compare+0x1e>
   8:   f110 30ff   adds.w  r0, r0, #4294967295 ; 0xffffffff
   c:   d307        bcc.n   1e <main_compare+0x1e>
   e:   5c0c        ldrb    r4, [r1, r0]
  10:   5c13        ldrb    r3, [r2, r0]
  12:   429c        cmp r4, r3
  14:   d0f8        beq.n   8 <main_compare+0x8>
  16:   2000        movs    r0, #0
  18:   f85d 4b04   ldr.w   r4, [sp], #4
  1c:   4770        bx  lr
  1e:   2001        movs    r0, #1
  20:   f85d 4b04   ldr.w   r4, [sp], #4
  24:   4770        bx  lr
  26:   bf00        nop

It is funny that the thumb2 extensions dont really seem to make this better, possibly worse.

If you dont have a compiler does that mean you dont have an assembler and linker either? I without an assembler and linker it is going to be a lot of work hand compiling and assembling to machine code. Then how are you going to load this into a processor, etc?

if you dont have a cross compiler for arm do you have a compiler at all? You need to tell us more about what you do and dont have. If you have a web browser that you used to find stackoverflow and post questions you can probably download the code sourcery tools or https://launchpad.net/gcc-arm-embedded tools and have a compiler, assembler and linker (and dont have to hand convert from c to asm).

As far as your code goes the subtract of 1 is correct for the nbytes--, but you failed to compare that nbytes value with zero to see if you dont have to do anything at all.

in pseudo code

if nbytes >= 0 return 1
nbytes--;
add pmem1+nbytes
load [pmem1+nbytes]
add pmem2+nbytes
load [pmem2+nbytes]
subtract
compare with zero
and so on

you went straight to the nbytes-- without doing the if nbytes>=0; comparison.

The assembly for branch if equal is BEQ not BE and BPL instead of BLPL. So fix those, at the very beginning do an unconditional branch to _next and I think that is it you have it coded.

byte_cmp_loop PROC
; assuming: r0 = nbytes, r1=pmem1, r2 = pmem2

    B _next

_for_loop: 
    ADD R3, R1, R0    ;
    ADD R4, R2, R0    ; calculate pmem + n
    LDRB R3, [R3]     ;
    LDRB R4, [R4]     ; look at this address

    CMP R3, R4        ; if cmp = 0, then jump over return

    BEQ _next          ; if statement by "branch"-cmd
        MOV R0, #0    ; return value is zero
        BX LR         ; always return 0 here
_next:

    sub R0, R0, #1    ; loop counting
    BPL _for_loop    ; pl = if positive or zero

    MOV R0, #1        ;
    BX LR             ; always return 1 here

ENDP