CPU usage measurment on arm bare metal system

Question 1

The question almost answers itself. What is your bare metal application doing when it is not in that process/algorithm? Measure one or the other or both. If you have a bare metal application that is not completely consuming the cpu in this algorithm, then you already have an operating system to the extent that you are managing this application/function's time. You can use a number of methods from a simple counter in a loop relative to a timer to see how many counts per loop when the algorithm is getting time slices vs not. You can simply time the algorithm itself, etc.

I assume when you say CPU you mean the whole system as your performance is heavily dependent both on your code and what it is talking to. If running from flash on a cortex-m4 depending on the clock rate you may be burning processor cycles just waiting for instructions or data (and can very easily get the wrong notion of processor performance for an algorithm when it isnt the algorithm burning clocks). The caches mask/manipulate that performance and can easily greatly affect the performance if you are not careful and aware of what they are doing. Being a C++ question your compiler plays a large role in performance as well as your code of course, can very easily make the code run several times faster or slower with minimal changes to the command line or code.

If the algorithm is part of an isr then the processor goes to sleep otherwise, you can use the gpio pin and scope techinique to get a feel for the running vs sleeping ratio.

Question 2

Implementing an OS to measure idle time of a CPU seems a bit overengineered for me. From my knowledge the Cortex-M4 includes a Debug unit (DWT) that allows you to snapshot a cycle counter. But the easiest thing would be to hook a pin to an oscilloscope and toggle it on enter and on exit of your algorithm.

Question 3

First, of all implementing an operating system will not be practical or even possible for the purpose of only measuring the performance.So one possible approach is to keep a count variable which will record the number of tick occurred till that duration. And increment that variable in a interrupt of the Timer.