C versus vDSP versus NEON - How could NEON be as slow as C?

Question 1

First, this is not "NEON" per-se. This is intrinsics. It is almost impossible to get good NEON performance using intrinsics under clang or gcc. If you think you need intrinsics, you should hand-write the assembler.

vDSP is not "better optimized" than NEON. vDSP on iOS uses the NEON processor. vDSP's use of the NEON is much better optimized than your use of the NEON.

I haven't dug through your intrinsics code yet, but the most likely (in fact almost certain) cause of trouble is that you're creating wait states. Writing in assembler (and intrinsics are just assembler written with welding gloves on), is nothing like writing in C. You don't loop the same. You don't compare the same. You need a new way of thinking. In assembly you can do more than one thing at a time (because you have different logic units), but you absolutely have to schedule things in such a way that all those things can run in parallel. Good assembly keeps all those pipelines full. If you can read your code and it makes perfect sense, it's probably crap assembly code. If you never repeat yourself, it's probably crap assembly code. You need to carefully consider what is going into what register and at how many cycles there are until you're allowed to read it.

If it were as easy as transliterating C, then the compiler would do that for you. The moment you say "I'm going to write this in NEON" you're saying "I think I can write better NEON than the compiler," because the compiler uses it too. That said, it often is possible to write better NEON than the compiler (particularly gcc and clang).

If you're ready to go diving into that world (and it's a pretty cool world), you have some reading ahead of you. Here's some places I recommend:

http://www.coranac.com/tonc/text/asm.htm (You really want to spend some time with this one)
http://hilbert-space.de/ (The whole site. He stopped writting way too soon.)
- To your specific question, he explains it here: http://hilbert-space.de/?p=22
Some more on your specific question: Arm Neon Intrinsics vs hand assembly
http://wanderingcoder.net/2010/07/19/ought-arm/
http://wanderingcoder.net/2010/06/02/intro-neon/
You may also be interested in:
- http://robnapier.net/blog/fast-bezier-intro-701
- http://robnapier.net/blog/faster-bezier-722

ALL THAT SAID... Always always always start by reconsidering your algorithm. Often the answer is not how to make your loop calculate quickly, it's how to not call the loop so often.

Question 2

ARM NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide). Your neon implementation already uses at least 18 128-bits wide, so compiler would generate code to move them back and forth from the stack, and that's not good - too much extra memory access.

If you plan to play with assembly, I found it best to use a tool to dump instructions in object files. One is called objdump in Linux, I believe it is called otool in Apple world. This way you can actually see how the resulting machine code looks like, and what did compiler do with your functions.

Below is some part of your neon implementation's dump from gcc (-O3) 4.7.1. You can notice loading a quad register via vldmia sp, {d8-d9}.

1a6:    ff24 cee8   vcgt.f32    q6, q10, q12
1aa:    ff64 4ec8   vcgt.f32    q10, q10, q4
1ae:    ff2e a1dc   vbit    q5, q15, q6
1b2:    ff22 ceea   vcgt.f32    q6, q9, q13
1b6:    ff5c 41da   vbsl    q10, q14, q5
1ba:    ff20 aeea   vcgt.f32    q5, q8, q13
1be:    f942 4a8d   vst1.32 {d20-d21}, [r2]!
1c2:    ec9d 8b04   vldmia  sp, {d8-d9}
1c6:    ff62 4ee8   vcgt.f32    q10, q9, q12
1ca:    f942 6a8f   vst1.32 {d22-d23}, [r2]

Of course this all depends on the compiler, a better compiler can avoid this situation by using available registers more clearly.

So at the end you are at the mercy of the compiler if you don't use assembly (inline, standalone) or should continuously check compiler output until you get what you want from it.

Question 3

As a complement to Rob's answer, that writing NEON is an art in and of itself (thanks for plugging my Wandering Coder posts, by the way), and auselen's answer (that you are indeed having too many registers live at any given time, leading to spilling), I should add that your intrinsics algorithm is more general than the other two: it allows arbitrary ranges, not just multiples, so you are attempting to compare things that are not comparable. Always compare oranges to oranges; with the exception however that it is fair game to compare a custom algorithm more specific than an off-the-shelf generic one if you only need the specific features of the custom one. So that is another way a NEON algorithm can be as slow as a C one: if they are not the same algorithm.

As for your histogramming needs, use what you've constructed with vDSP for the time being and only if the performance is not satisfying for your application, only then, investigate optimizing another way; avenues to do so, besides using NEON instructions, would include avoiding so much memory movement (likely the bottleneck in the vDSP implementation), and incrementing the counters for each bucket as you are browsing the pixels, instead of having this intermediate output made of coerced values. Efficient DSP code is not just about the calculations themselves, but also how to most efficiently use memory bandwidth and so forth. Even more so on mobile: memory I/O, even to the caches, is more power-intensive than in-processor-core operations, so both the memory I/O buses tend to be run at a lower fraction of the processor clock speed, so you don't have that much memory bandwidth to play with, and you should wisely use the memory bandwidth you do have, as any use of it takes power.