Low Memory throughput in Linux-Embedded ( ARM )

https://stackoverflow.com/questions/1400159

05-07-2019
|

Question

I am using ARM926EJS. I am getting 20 % more memory speed in memory copy test, without Linux ( Just as a Getting Started executable). But in linux same code is running 20% slower.

Code is

 
/// Below code just performs burst mode memcopy test.        
void asmcpy(void *a, void *b, int iSize)
{
   do
  {
    asm volatile (
             "ldmia %0!, {r3-r10} \n\t"
             "stmia %0!, {r3-r10} \n\t"
             :"+r"(a), "+r"(b)
             :
             :"r"(r3),"r"(r4),"r"(r5),"r"(r6),"r"(r7),"r"(r8),"r"(r9),"r"(r10)
             );
  }while(size--)
}

I verified that no other process is taking CPU time on linux.( I checked this with the use of time command, It shows real time is same as usr time)

Please tell me what can be the problem with linux?

Thanks & Regards.

ADDED:

my test code is

int main()
{
  int a[320 * 120], b[320 * 120];

 for(int i=0; i != 10000; i++)
 {
   /// Size is divided by 8 because our memcpy function performs 8 integer load stores in the iteration
   asmcpy(a, b, (320 * 120) / 8);
 }
}

Getting Started executable is a bin file which is sent to the RAM using serial port and executes directly by jumping to that address in RAM. (without the need of an OS)

ADDED.

I haven't seen such performance difference on other processors.They were using SD RAM, This processor is using DDR Ram. Can it be a reason?

ADDED. Data Cache is not enabled in getting started code and Data Cache is eabled in Linux mode, So Ideally all data should be cached and get accessed without any RAM latency, But still Linux is 20% slow.

ADDED: My microcontroller is LPC3250. Both the test are been tested on same external DDR RAM.

Solution

This chip has an MMU, so Linux is likely using it to manage memory. Maybe just enabling it introduces some performance hit. Also, Linux uses a lazy memory allocation strategy, only assigning memory pages to a process when it first hits it. If you're copying a big chunk of memory, the MMU will generate page faults to ask the kernel to allocate a page while inside your loop. On a low-end processor, all these context switches cause cache flushes and introduce a noticeable slowdown.

If your system is small enough, try an MMU-less version of Linux (like uClinux). Maybe it would let you use a cheaper chip with similar performance. On embedded systems, every penny counts.

update: Some extra details:

Every Linux process gets it's own memory mappings, At first this include only the kernel and (maybe) executable code. All the rest of the linear 4GB (on 32bit) seems available, but there's no RAM pages assigned to them. As soon as you read or write an unallocated memory address, the MMU signals a page fault and switches to the kernel. The kernel sees that it still has lots of free RAM pages, so picks one, assigns it to the faulted point and returns to your code, which finishes the interrupted instruction. The very next one won't fail because the whole page (typically 4KB) is already assigned; but a few iterations later, it will hit another non-assigned space, and the MMU will invoke the kernel again.

OTHER TIPS

How are you performing the timing? There is no timing code in your example.

Are you sure that you are not measuring process load/unload time?

Is the processor clock speed the same in both cases?

If using external SDRAM are the RAM timings the same in both cases?

Is the data cache enabled in both cases?

Clifford

Getting started is not "just an executable". There must be some code to set the DDR controller register.

If cache is also enabled, then so must be the MMU. I think on ARM926EJS, you can't have data cache without MMU.

I believe every context switch results in a cache flush, because the cache is virtually indexed, virtually tagged and Kernel and Userspace don't share the same address space, so you probably have a lot more unwanted cache flush in the than without OS.

Here is a paper with some aspect on the cost of VIVT cache flush when running Linux

What microcontroller (not just what ARM CPU) are you using?

Is it possible that in the non-Linux run the array you're testing is RAM on the microcontroller device itself while in the Linux test the array being tested is in external RAM? Internal RAM is usually accessed much faster than external RAM - this might account for the Linux test being slower, even if data caching is enabled only for the Linux run.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow