ARM assembly: auto-increment register on store

https://stackoverflow.com/questions/9102113

21-04-2021
|

Question

Is it possible to auto-increment the base address of a register on a STR with a [Rn]!? I've peered through the documentation but haven't been able to find a definitive answer, mainly because the command syntax is presented for both LDR and STR - in theory it should work for both, but I couldn't find any examples of auto-incrementing on a store (the loading works ok).

I've made a small program which stores two numbers in a vector. When it's done the contents of out should be {1, 2} but the store overwrites the first byte, as if the auto-increment isn't working.

#include <stdio.h>

int main()
{
        int out[]={0, 0};
        asm volatile (
        "mov    r0, #1          \n\t"
        "str    r0, [%0]!       \n\t"
        "add    r0, r0, #1      \n\t"
        "str    r0, [%0]        \n\t"
        :: "r"(out)
        : "r0" );
        printf("%d %d\n", out[0], out[1]);
        return 0;
}

EDIT: While the answer was right for regular loads and stores, I found that the optimizer messes up auto-increment on vector instructions such as vldm/vstm. For instance, the following program

#include <stdio.h>

int main()
{
        volatile int *in = new int[16];
        volatile int *out = new int[16];

        for (int i=0;i<16;i++) in[i] = i;

        asm volatile (
        "vldm   %0!, {d0-d3}            \n\t"
        "vldm   %0,  {d4-d7}            \n\t"
        "vstm   %1!, {d0-d3}            \n\t"
        "vstm   %1,  {d4-d7}            \n\t"
        :: "r"(in), "r"(out)
        : "memory" );

        for (int i=0;i<16;i++) printf("%d\n", out[i]);
        return 0;
}

compiled with

g++ -O2 -march=armv7-a -mfpu=neon main.cpp -o main

will produce gibberish on the output of the last 8 variables, because the optimizer is keeping the incremented variable and using it for the printf. In other words, out[i] is actually out[i+8], so the first 8 printed values are the last 8 from the vector and the rest are memory locations out of bounds.

I've tried with different combinations of the volatile keyword throughout the code, but the behavior changes only if I compile with the -O0 flag or if I use a volatile vector instead of a pointer and new, like

volatile int out[16];

Solution

For store and load you do this:

ldr r0,[r1],#4
str r0,[r2],#4

whatever you put at the end, 4 in this case, is added to the base register (r1 in the ldr example and r2 in the str example) after the register is used for the address but before the instruction has completed it is very much like

unsigned int a,*b,*c;
...
a = *b++;
*c++ = a;

EDIT, you need to look at the disassembly to see what is going on, if anything. I am using the latest code sourcery or now just sourcery lite from mentor graphics toolchain.

arm-none-linux-gnueabi-gcc (Sourcery CodeBench Lite 2011.09-70) 4.6.1

#include <stdio.h>
int main ()
{
        int out[]={0, 0};
        asm volatile (
        "mov    r0, #1          \n\t"
        "str    r0, [%0], #4       \n\t"
        "add    r0, r0, #1      \n\t"
        "str    r0, [%0]        \n\t"
        :: "r"(out)
        : "r0" );
        printf("%d %d\n", out[0], out[1]);
        return 0;
}


arm-none-linux-gnueabi-gcc str.c -O2  -o str.elf

arm-none-linux-gnueabi-objdump -D str.elf > str.list


00008380 <main>:
    8380:   e92d4010    push    {r4, lr}
    8384:   e3a04000    mov r4, #0
    8388:   e24dd008    sub sp, sp, #8
    838c:   e58d4000    str r4, [sp]
    8390:   e58d4004    str r4, [sp, #4]
    8394:   e1a0300d    mov r3, sp
    8398:   e3a00001    mov r0, #1
    839c:   e4830004    str r0, [r3], #4
    83a0:   e2800001    add r0, r0, #1
    83a4:   e5830000    str r0, [r3]
    83a8:   e59f0014    ldr r0, [pc, #20]   ; 83c4 <main+0x44>
    83ac:   e1a01004    mov r1, r4
    83b0:   e1a02004    mov r2, r4
    83b4:   ebffffe5    bl  8350 <_init+0x20>
    83b8:   e1a00004    mov r0, r4
    83bc:   e28dd008    add sp, sp, #8
    83c0:   e8bd8010    pop {r4, pc}
    83c4:   0000854c    andeq   r8, r0, ip, asr #10

so the

sub sp, sp, #8

is to allocate the two local ints out[0] and out[1]

mov r4,#0
str r4,[sp]
str r4,[sp,#4]

is because they are initialized to zero, then comes the inline assembly

8398:   e3a00001    mov r0, #1
839c:   e4830004    str r0, [r3], #4
83a0:   e2800001    add r0, r0, #1
83a4:   e5830000    str r0, [r3]

and then the printf:

83a8:   e59f0014    ldr r0, [pc, #20]   ; 83c4 <main+0x44>
83ac:   e1a01004    mov r1, r4
83b0:   e1a02004    mov r2, r4
83b4:   ebffffe5    bl  8350 <_init+0x20>

and now it is clear why it didnt work. you are didnt declare out as volatile. You gave the code no reason to go back to ram to get the values of out[0] and out[1] for the printf, the compiler knows that r4 contains the value for both out[0] and out[1], there is so little code in this function that it didnt have to evict r4 and reuse it so it used r4 for the printf.

If you change it to be volatile

    volatile int out[]={0, 0};

Then you should get the desired result:

83a8:   e59f0014    ldr r0, [pc, #20]   ; 83c4 <main+0x44>
83ac:   e59d1000    ldr r1, [sp]
83b0:   e59d2004    ldr r2, [sp, #4]
83b4:   ebffffe5    bl  8350 <_init+0x20>

the preparation for printf reads from ram.

OTHER TIPS

GCC inline assembler requires that all modified registers and non-volatile variables are listed as outputs or clobbers. In the second example GCC may and does assume that the registers allocated to in and out do not change.

A correct approach would be:

out_temp = out;
asm volatile ("..." : "+r"(in), "+r"(out_temp) :: "memory" );

I found this question while searching for the answer for a similar question: How to bind an input/output register. The GCC documentation of the inline assembler constrants says that the + prefix in the input register list designates an input/output register.

In the example, it seems to me that you would prefer to preserve the original value of the variable out. Nevertheless, if you want to use the post-increment (!) variant of the instructions, I think that you should declare the parameters as read/write. The following worked on my Raspberry Pi 2:

#include <stdio.h>

int main()
{
  int* in = new int(16);
  volatile int* out = new int(16);

  for (int i=0; i<16; i++) in[i]=i;

  asm volatile(
    "vldm %0!, {d0-d3}\n\t"
    "vldm %0, {d4-d7}\n\t"
    "vstm %1!, {d0-d3}\n\t"
    "vstm %1, {d4-d7}\n\t"
    :"+r"(in), "+r"(out) :: "memory");

  for (int i=0; i<16; i++) printf("%d\n", out[i-8]);
  return 0;
}

In this way, the semantics of the code is clear to the compiler: both the in and out pointers will be changed (incremented by 8 elements).

Disclaimer: I do not know if the ARM ABI allows a function to freely clobber the NEON registers d0 through d7. In this simple example it probably does not matter.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow