How to measure program execution time in ARM Cortex-A8 processor?

https://stackoverflow.com/questions/3247373

15-09-2020
|

Question

I'm using an ARM Cortex-A8 based processor called as i.MX515. There is linux Ubuntu 9.10 distribution. I'm running a very big application written in C and I'm making use of gettimeofday(); functions to measure the time my application takes.

main()

{

gettimeofday(start);
....
....
....
gettimeofday(end);

}

This method was sufficient to look at what blocks of my application was taking what amount of time. But, now that, I'm trying to optimize my code very throughly, with the gettimeofday() method of calculating time, I see a lot of fluctuation between successive runs (Run before and after my optimizations), so I'm not able to determine the actual execution times, hence the impact of my improvements.

Can anyone suggest me what I should do?

If by accessing the cycle counter (Idea suggested on ARM website for Cortex-M3) can anyone point me to some code which gives me the steps I have to follow to access the timer registers on Cortex-A8?

If this method is not very accurate then please suggest some alternatives.

Thanks

Follow ups

Follow up 1: Wrote the following program on Code Sorcery, the executable was generated which when I tried running on the board, I got - Illegal instruction message :(

static inline unsigned int get_cyclecount (void)
{
    unsigned int value;
    // Read CCNT Register
    asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));
    return value;
}

static inline void init_perfcounters (int32_t do_reset, int32_t enable_divider)
{
    // in general enable all counters (including cycle counter)
    int32_t value = 1;

    // peform reset:
    if (do_reset)
    {
    value |= 2;     // reset all counters to zero.
    value |= 4;     // reset cycle counter to zero.
    }

    if (enable_divider)
    value |= 8;     // enable "by 64" divider for CCNT.

    value |= 16;

    // program the performance-counter control-register:
    asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));

    // enable all counters:
    asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));

    // clear overflows:
    asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}



int main()
{

    /* enable user-mode access to the performance counter*/
asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));

/* disable counter overflow interrupts (just in case)*/
asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));

    init_perfcounters (1, 0);

    // measure the counting overhead:
    unsigned int overhead = get_cyclecount();
    overhead = get_cyclecount() - overhead;

    unsigned int t = get_cyclecount();

    // do some stuff here..
    printf("\nHello World!!");

    t = get_cyclecount() - t;

    printf ("function took exactly %d cycles (including function call) ", t - overhead);

    get_cyclecount();

    return 0;
}

Follow up 2: I had written to Freescale for support and they have sent me back the following reply and a program (I did not quite understand much from it)

Here is what we can help you with right now: I am sending you attach an example of code, that sends an stream using the UART, from what your code, it seems that you are not init correctly the MPU.

(hash)include <stdio.h>
(hash)include <stdlib.h>

(hash)define BIT13 0x02000

(hash)define R32   volatile unsigned long *
(hash)define R16   volatile unsigned short *
(hash)define R8   volatile unsigned char *

(hash)define reg32_UART1_USR1     (*(R32)(0x73FBC094))
(hash)define reg32_UART1_UTXD     (*(R32)(0x73FBC040))

(hash)define reg16_WMCR         (*(R16)(0x73F98008))
(hash)define reg16_WSR              (*(R16)(0x73F98002))

(hash)define AIPS_TZ1_BASE_ADDR             0x70000000
(hash)define IOMUXC_BASE_ADDR               AIPS_TZ1_BASE_ADDR+0x03FA8000

typedef unsigned long  U32;
typedef unsigned short U16;
typedef unsigned char  U8;


void serv_WDOG()
{
    reg16_WSR = 0x5555;
    reg16_WSR = 0xAAAA;
}


void outbyte(char ch)
{
    while( !(reg32_UART1_USR1 & BIT13)  );

    reg32_UART1_UTXD = ch ;
}


void _init()
{

}



void pause(int time) 
{
    int i;

    for ( i=0 ; i < time ;  i++);

} 


void led()
{

//Write to Data register [DR]

    *(R32)(0x73F88000) = 0x00000040;  // 1 --> GPIO 2_6 
    pause(500000);

    *(R32)(0x73F88000) = 0x00000000;  // 0 --> GPIO 2_6 
    pause(500000);


}

void init_port_for_led()
{


//GPIO 2_6   [73F8_8000] EIM_D22  (AC11)    DIAG_LED_GPIO
//ALT1 mode
//IOMUXC_SW_MUX_CTL_PAD_EIM_D22  [+0x0074]
//MUX_MODE [2:0]  = 001: Select mux mode: ALT1 mux port: GPIO[6] of instance: gpio2.

 // IOMUXC control for GPIO2_6

*(R32)(IOMUXC_BASE_ADDR + 0x74) = 0x00000001; 

//Write to DIR register [DIR]

*(R32)(0x73F88004) = 0x00000040;  // 1 : GPIO 2_6  - output

*(R32)(0x83FDA090) = 0x00003001;
*(R32)(0x83FDA090) = 0x00000007;


}

int main ()
{
  int k = 0x12345678 ;

    reg16_WMCR = 0 ;                        // disable watchdog
    init_port_for_led() ;

    while(1)
    {
        printf("Hello word %x\n\r", k ) ;
        serv_WDOG() ;
        led() ;

    }

    return(1) ;
}

Solution

Accessing the performance counters isn't difficult, but you have to enable them from kernel-mode. By default the counters are disabled.

In a nutshell you have to execute the following two lines inside the kernel. Either as a loadable module or just adding the two lines somewhere in the board-init will do:

  /* enable user-mode access to the performance counter*/
  asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));

  /* disable counter overflow interrupts (just in case)*/
  asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));

Once you did this the cycle counter will start incrementing for each cycle. Overflows of the register will go unnoticed and don't cause any problems (except they might mess up your measurements).

Now you want to access the cycle-counter from the user-mode:

We start with a function that reads the register:

static inline unsigned int get_cyclecount (void)
{
  unsigned int value;
  // Read CCNT Register
  asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));  
  return value;
}

And you most likely want to reset and set the divider as well:

static inline void init_perfcounters (int32_t do_reset, int32_t enable_divider)
{
  // in general enable all counters (including cycle counter)
  int32_t value = 1;

  // peform reset:  
  if (do_reset)
  {
    value |= 2;     // reset all counters to zero.
    value |= 4;     // reset cycle counter to zero.
  } 

  if (enable_divider)
    value |= 8;     // enable "by 64" divider for CCNT.

  value |= 16;

  // program the performance-counter control-register:
  asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));  

  // enable all counters:  
  asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));  

  // clear overflows:
  asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}

do_reset will set the cycle-counter to zero. Easy as that.

enable_diver will enable the 1/64 cycle divider. Without this flag set you'll be measuring each cycle. With it enabled the counter gets increased for every 64 cycles. This is useful if you want to measure long times that would otherwise cause the counter to overflow.

How to use it:

  // init counters:
  init_perfcounters (1, 0); 

  // measure the counting overhead:
  unsigned int overhead = get_cyclecount();
  overhead = get_cyclecount() - overhead;    

  unsigned int t = get_cyclecount();

  // do some stuff here..
  call_my_function();

  t = get_cyclecount() - t;

  printf ("function took exactly %d cycles (including function call) ", t - overhead);

Should work on all Cortex-A8 CPUs..

Oh - and some notes:

Using these counters you'll measure the exact time between the two calls to get_cyclecount() including everything spent in other processes or in the kernel. There is no way to restrict the measurement to your process or a single thread.

Also calling get_cyclecount() isn't free. It will compile to a single asm-instruction, but moves from the co-processor will stall the entire ARM pipeline. The overhead is quite high and can skew your measurement. Fortunately the overhead is also fixed, so you can measure it and subtract it from your timings.

In my example I did that for every measurement. Don't do this in practice. An interrupt will sooner or later occur between the two calls and skew your measurements even further. I suggest that you measure the overhead a couple of times on an idle system, ignore all outsiders and use a fixed constant instead.

OTHER TIPS

You need to profile your code with performance analysis tools before and after your optimizations.

Acct is a command line and a function which you can use to monitor your resources. You can google more on the usage and viewing of the dat file hence generated by acct.

I will update this post with other opensource performance analysis tools.

Gprof is another such tool. Please check the documentation for the same.

To expand on the answer by Nils now that a couple of years have elapsed! - an easy way to access these counters is to build the kernel with gator. This then reports counter values for use with Streamline, which is ARM's performance analysis tool.

It will display each function on a timeline (giving you a high-level overview of how your system is performing), showing you exactly how long it took to execute, along with % CPU that it has taken up. You can compare this with charts of each counter that you've set it up to collect and follow CPU intensive tasks down to source code level.

Streamline works with all the Cortex-A series processors.

I've worked in an toolchain for ARM7 which had an instruction level simulator. Running apps in that could give timings for individual lines and/or asm instruction. That was great for a micro optimization of a given routine. That approach probably isn't appropriate for a whole app/whole system optimization though.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow