Question

I want to capture every frame from a video to make some modification before rendering in Android device, such as Nexus 10. As I know, android uses hardware to decode and render the frame in the specific device, so I should get the frame data from GraphicBuffer, and before rendering the data will be YUV format.

Also I write a static method in AwesomePlayer.cpp to implement that capture frame data / modify the frame / write it back into GraphicBuffer to render.

Here is my demo code

static void handleFrame(MediaBuffer *buffer) {

    sp<GraphicBuffer> buf = buffer->graphicBuffer();

    size_t width = buf->getWidth();
    size_t height = buf->getHeight();
    size_t ySize = buffer->range_length();
    size_t uvSize = width * height / 2;

    uint8_t *yBuffer = (uint8_t *)malloc(ySize + 1);
    uint8_t *uvBuffer = (uint8_t *)malloc(uvSize + 1);
    memset(yBuffer, 0, ySize + 1);
    memset(uvBuffer, 0, uvSize + 1);

    int const *private_handle = buf->handle->data;

    void *yAddr = NULL;
    void *uvAddr = NULL;

    buf->lock(GRALLOC_USAGE_SW_READ_OFTEN | GRALLOC_USAGE_SW_WRITE_OFTEN, &yAddr);
    uvAddr = mmap(0, uvSize, PROT_READ | PROT_WRITE, MAP_SHARED, *(private_handle + 1));

    if(yAddr != NULL && uvAddr != NULL) {

      //memcpy data from graphic buffer
      memcpy(yBuffer, yAddr, ySize);
      memcpy(uvBuffer, uvAddr, uvSize);

      //modify the YUV data

      //memcpy data into graphic buffer
      memcpy(yAddr, yBuffer, ySize);
      memcpy(uvAddr, uvBuffer, uvSize);
    }

    munmap(uvAddr, uvSize);
    buf->unlock();

    free(yBuffer);
    free(uvBuffer);

}

I printed the timestamp for memcpy function, and I realized that memcpy from GraphicBuffer takes much more time than memcpy data into GraphicBuffer. Take the video with resolution 1920x1080 for example, memcpy from GraphicBuffer takes about 30ms, it is unacceptable for normal video play.

I have no idea why it takes so much time, maybe it copies data from GPU buffer, but copy data into GraphicBuffer looks normal.

Could anyone else who is familiar with hardware decode in android take a look at this issue? Thanks very much.

Update: I found that I didn't have to use GraphicBuffer to get the YUV data, I just used hardware decode the video source and storage the YUV data to memory, so that I could get YUV data from memory directly, it's very fast. Actually you could found the similar solution in AOSP source code or open source video display app. I just allocate the memory buffers rather than graphic buffers, and then use the hardware decoder. Sample code in AOSP: frameworks/av/cmds/stagefright/SimplePlayer.cpp

link: https://github.com/xdtianyu/android-4.2_r1/tree/master/frameworks/av/cmds/stagefright

Was it helpful?

Solution

Most likely, the data path (a.k.a. databus) from your CPU to the graphics memory is optimized. The path from graphics memory to CPU may not be optimized. Optimizations may include different speed internal databus, level 1 or 2 cache, and wait-states.

The electronics (hardware) has set the maximum speed for transferring data from the Graphics Memory to your CPU. The memory of the CPU is probably slower than your graphics memory, so there may be wait-states in order for the Graphics Memory to match the slower speed of the CPU memory.

Another issue is all the devices sharing the data bus. Imagine a shared highway between cities. To optimize traffic, traffic is only allowed one direction. Traffic signals or people, monitor the traffic. In order to go from City A to City C, one has to wait until the traffic signals or director, clear remaining traffic and give the route City A to City C priority. In hardware terms, this is called Bus Arbitration.

In most platforms, the CPU is transferring data between registers and the CPU memory. This is needed to read and write your variables in your program. The slow route of transferring data is for the CPU to read memory into a register, then write to the Graphics Memory. A more efficient method is to transfer the data without using the CPU. There may exist a device, DMA (Direct Memory Access), which can transfer data without using the CPU. You tell it the source and target memory locations, then start it. It will transfer the data without using the CPU.

Unfortunately, the DMA must share the data bus with the CPU. This means that your data transfer will be slowed by any requests for the data bus by the CPU. It will still be faster than using the CPU to transfer the data as the DMA can be transferring the data while the CPU is executing instructions that don't require the data bus.

Summary
Your memory transfers may be slow if you don't have a DMA device. With or without the DMA, the data bus is shared by multiple devices and traffic arbitrated. This sets the maximum overall speed for transferring data. Data transfer speeds of the memory chips may also contribute to the data transfer rate. Hardware-wise, there is a speed limit.

Optimizations
1. Use the DMA, if possible.
2. If only using CPU, have CPU transfer the largest chunks possible.
This means using instructions specifically for copying memory.
3. If your CPU doesn't have specialized copy instructions, transfer using the word size of the processor.
If the processor has 32-bit words, transfer 4 bytes at a time with 1 word rather than using 4 8-bit copies.
4. Reduce CPU demands and interruptions during the transfer.
Pause any applications; disable interrupts if possible.
5. Divide the effort: Have one core transfer the data while another core is executing your program.
6. Threading on a single core may actually slow the transfer, as the OS gets involved because of scheduling. The thread switching takes time which adds to the transfer time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top