memcpy from graphic buffer is slow in Android

Question

Most likely, the data path (a.k.a. databus) from your CPU to the graphics memory is optimized. The path from graphics memory to CPU may not be optimized. Optimizations may include different speed internal databus, level 1 or 2 cache, and wait-states.

The electronics (hardware) has set the maximum speed for transferring data from the Graphics Memory to your CPU. The memory of the CPU is probably slower than your graphics memory, so there may be wait-states in order for the Graphics Memory to match the slower speed of the CPU memory.

Another issue is all the devices sharing the data bus. Imagine a shared highway between cities. To optimize traffic, traffic is only allowed one direction. Traffic signals or people, monitor the traffic. In order to go from City A to City C, one has to wait until the traffic signals or director, clear remaining traffic and give the route City A to City C priority. In hardware terms, this is called Bus Arbitration.

In most platforms, the CPU is transferring data between registers and the CPU memory. This is needed to read and write your variables in your program. The slow route of transferring data is for the CPU to read memory into a register, then write to the Graphics Memory. A more efficient method is to transfer the data without using the CPU. There may exist a device, DMA (Direct Memory Access), which can transfer data without using the CPU. You tell it the source and target memory locations, then start it. It will transfer the data without using the CPU.

Unfortunately, the DMA must share the data bus with the CPU. This means that your data transfer will be slowed by any requests for the data bus by the CPU. It will still be faster than using the CPU to transfer the data as the DMA can be transferring the data while the CPU is executing instructions that don't require the data bus.

Summary
Your memory transfers may be slow if you don't have a DMA device. With or without the DMA, the data bus is shared by multiple devices and traffic arbitrated. This sets the maximum overall speed for transferring data. Data transfer speeds of the memory chips may also contribute to the data transfer rate. Hardware-wise, there is a speed limit.

Optimizations
1. Use the DMA, if possible.
2. If only using CPU, have CPU transfer the largest chunks possible.
This means using instructions specifically for copying memory.
3. If your CPU doesn't have specialized copy instructions, transfer using the word size of the processor.
If the processor has 32-bit words, transfer 4 bytes at a time with 1 word rather than using 4 8-bit copies.
4. Reduce CPU demands and interruptions during the transfer.
Pause any applications; disable interrupts if possible.
5. Divide the effort: Have one core transfer the data while another core is executing your program.
6. Threading on a single core may actually slow the transfer, as the OS gets involved because of scheduling. The thread switching takes time which adds to the transfer time.