Question

I am attempting to write some NEON code for optimal filling of word arrays on iPhone/iPad. What is so very strange about this issue is that the code seems to jump into a function named _ARCLite_load when a NEON instruction assigns a value to q3. Has anyone seen something like this before:

(test_time_asm.s compiled with xcode 4.6 and the -no-integrated-as flag)

.section __TEXT,__text,regular
.section __TEXT,__textcoal_nt,coalesced
.section __TEXT,__const_coal,coalesced
.section __TEXT,__picsymbolstub4,symbol_stubs,none,16
.text   
.align 2
.globl _fill_neon_loop1
.private_extern _fill_neon_loop1
_fill_neon_loop1:
  push {r4, r5, r6, r7, lr}
  // r0 = wordPtr
  // r1 = inWord
  // r2 = numWordsToFill
  mov   r2, #1024
  // Load r1 (inWord) into NEON registers
  vdup.32 q0, r1
  vdup.32 q1, r1
  vdup.32 q2, r1
  vdup.32 q3, r1 (Stepping into this instruction jumps into __ARCLite__load)

NEONFILL16_loop1:
  vstm r0!, {d0-d7}
  sub r2, r2, #16
  cmp r2, #15
  bgt NEONFILL16_loop1

  mov   r0, #0
  pop {r4, r5, r6, r7, pc}
  .subsections_via_symbols

Single stepping through the ASM instructions work up until the instruction that assigns to q3. When I step over that instruction, the code seems to jump here:

(gdb) bt
#0  0x0009a568 in __ARCLite__load () at /SourceCache/arclite_iOS/arclite-31/source/arclite.m:529
#1  0x0007b050 in test_time_run_cases () at test_time.h:147

This is really strange and I am really quite at a loss to understand why assigning to a NEON register would cause this. Does NEON use q3 for something special that I am unaware of?

I also tried to load up the registers using dN (64 bit regs), with the same results on assignment to d7.

  vdup.32 d0, r1
  vdup.32 d1, r1
  vdup.32 d2, r1
  vdup.32 d3, r1
  vdup.32 d4, r1
  vdup.32 d5, r1
  vdup.32 d6, r1
  vdup.32 d7, r1

(later) After messing around with the suggested changes, I found the root cause of the problem. It was this branch label:

NEONFILL16_loop1:
  vstm r0!, {d0-d7}
  sub r2, r2, #16
  cmp r2, #15
  bgt NEONFILL16_loop1

For some reason, the branch label was causing a jump to another location in the code. Replacing the label above with the following fixed the problem:

1:
  vstm r0!, {d0-d7}
  sub r2, r2, #16
  cmp r2, #15
  bgt 1b

This could be some weird thing with the version of the ASM parser in clang delivered with xcode 4.6, but anyway just changing the label fixed it.

Was it helpful?

Solution

q3 is neither assigned to some special roles nor needs to be preserved. Don't worry about this.

I think auselen is right with his guess. Just looking at the disassembly will make it clear.

Try this below though :

.section __TEXT,__text,regular
.section __TEXT,__textcoal_nt,coalesced
.section __TEXT,__const_coal,coalesced
.section __TEXT,__picsymbolstub4,symbol_stubs,none,16
.text   
.align 2
.globl _fill_neon_loop1
.private_extern _fill_neon_loop1
_fill_neon_loop1:
  // r0 = wordPtr
  // r1 = inWord
  // r2 = numWordsToFill
  mov   r2, #1024
  // Load r1 (inWord) into NEON registers
  vdup.32 q0, r1
  vdup.32 q1, r1
  vdup.32 q2, r1
  vdup.32 q3, r1
  subs r2, r2, #16
  bxmi lr

NEONFILL16_loop1:
  vstm r0!, {d0-d7}
  subs r2, r2, #16
  bpl NEONFILL16_loop1

  mov   r0, #0
  bx lr
  .subsections_via_symbols

I removed the obsolete register preserving completely in addition to the cmp within the loop. (You know, I HAVE to optimize everything :))

If auselen's guess is right, this might have changed the tracing timing and stepping into ARClite will occur at a later point.

OTHER TIPS

Almost every time I've jumped somewhere strange in handwritten ARM code it's been because I've fumbled the thumb interworking and the function has been executing in the wrong mode -- consequently the instruction stream looks like garbage to the CPU, and it jumps about randomly until it hurts itself and falls over.

For all labels which are function entrypoints, you should have this assembly directive:

.type _fill_neon_loop1, %function

This tells the linker that when it fixes up BL instructions, or when it computes the address of the function, it should make appropriate adjustments to ensure it's executed in the correct mode.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top