How to obtain reliable Cortex M4 short delays

Question 1

If you need such very short, but deterministic "at least" delays, maybe you could consider using other instructions than nop which have deterministic nonzero latency.

The Cortex-M4 NOP as described is not necessarily time consuming.

You could replace it to, say and reg, reg, or something coarsely equivalent to a nop in the context. Alternatively, when toggling GPIO, you could also repeat the I/O instructions themselves to enforce the minimal length of a state (such as if your GPIO writing instruction takes at least 5ns, repeat it five times to get at least 25ns). This could even work well within C if you were inserting nops in a C program (just repeat the writes to the port, if it's volatile as it should be, the compiler wouldn't remove the repeated accesses).

Of course this only applies to very short delays, otherwise for short delays, like mentioned by others, busy loops waiting for some timing source would work much better (they take at least the clocks required to sample the timing source, set up the target, and go through once the wait loop).

Question 2

Use the cycle-counting register (DWT_CYCCNT) to get high-precision timing!

Note: I have also tested this using digital pins and an oscilloscope, and it is extremely accurate.

See stopwatch_delay(ticks) and supporting code below, which uses the STM32's DWT_CYCCNT register, specifically designed to count actual clock ticks, located at address 0xE0001004.

See main for an example which uses STOPWATCH_START/STOPWATCH_STOP to measure how long the stopwatch_delay(ticks) actually took, using CalcNanosecondsFromStopwatch(m_nStart, m_nStop).

Modify the ticks input to make adjustments

uint32_t m_nStart;               //DEBUG Stopwatch start cycle counter value
uint32_t m_nStop;                //DEBUG Stopwatch stop cycle counter value

#define DEMCR_TRCENA    0x01000000

/* Core Debug registers */
#define DEMCR           (*((volatile uint32_t *)0xE000EDFC))
#define DWT_CTRL        (*(volatile uint32_t *)0xe0001000)
#define CYCCNTENA       (1<<0)
#define DWT_CYCCNT      ((volatile uint32_t *)0xE0001004)
#define CPU_CYCLES      *DWT_CYCCNT
#define CLK_SPEED         168000000 // EXAMPLE for CortexM4, EDIT as needed

#define STOPWATCH_START { m_nStart = *((volatile unsigned int *)0xE0001004);}
#define STOPWATCH_STOP  { m_nStop = *((volatile unsigned int *)0xE0001004);}


static inline void stopwatch_reset(void)
{
    /* Enable DWT */
    DEMCR |= DEMCR_TRCENA; 
    *DWT_CYCCNT = 0;             
    /* Enable CPU cycle counter */
    DWT_CTRL |= CYCCNTENA;
}

static inline uint32_t stopwatch_getticks()
{
    return CPU_CYCLES;
}

static inline void stopwatch_delay(uint32_t ticks)
{
    uint32_t end_ticks = ticks + stopwatch_getticks();
    while(1)
    {
            if (stopwatch_getticks() >= end_ticks)
                    break;
    }
}

uint32_t CalcNanosecondsFromStopwatch(uint32_t nStart, uint32_t nStop)
{
    uint32_t nDiffTicks;
    uint32_t nSystemCoreTicksPerMicrosec;

    // Convert (clk speed per sec) to (clk speed per microsec)
    nSystemCoreTicksPerMicrosec = CLK_SPEED / 1000000;

    // Elapsed ticks
    nDiffTicks = nStop - nStart;

    // Elapsed nanosec = 1000 * (ticks-elapsed / clock-ticks in a microsec)
    return 1000 * nDiffTicks / nSystemCoreTicksPerMicrosec;
} 

void main(void)
{
    int timeDiff = 0;
    stopwatch_reset();

    // =============================================
    // Example: use a delay, and measure how long it took
    STOPWATCH_START;
    stopwatch_delay(168000); // 168k ticks is 1ms for 168MHz core
    STOPWATCH_STOP;

    timeDiff = CalcNanosecondsFromStopwatch(m_nStart, m_nStop);
    printf("My delay measured to be %d nanoseconds\n", timeDiff);

    // =============================================
    // Example: measure function duration in nanosec
    STOPWATCH_START;
    // run_my_function() => do something here
    STOPWATCH_STOP;

    timeDiff = CalcNanosecondsFromStopwatch(m_nStart, m_nStop);
    printf("My function took %d nanoseconds\n", timeDiff);
}

Question 3

For any reliable timing, I always suggest using a general purpose timer. Your part may have a timer that is capable of clocking high enough to give you the timing you need. For serial, is there a reason you can't use a corresponding serial peripheral? Most of the Cortex M3/M4s that I'm aware of offer USARTS, I2C, and SPI, with multiple also offering SDIO, which should cover most needs.

If that is not possible, this stackoverflow question/answer details using the cycle counter, if available, on a Cortex M3/M4. You could grab the cycle counter and add a few to it and poll it, but I don't think you would achieve anything reasonably below ~8 cycles for minimum delay with this method.

Question 4

Well first you have to run from ram not flash as the flash timing is going to be slow, one nop can take many cycles. the gpio accesses should take a few clocks at least as well so you probably wont need/want nops just pound on the gpio. The branch at the end of the loop will be noticeable as well. you should write a few instructions to ram and branch to it and see how fast you can wiggle the gpio.

The bottom line though is that if you are on such a tight budget that your serial clock is that close to your processor clock in speed, it is very likely you are not going to get this to work with this processor. upping the pll in the processor wont change the flash speed, it can make it worse (relative to the processor clock) the sram should scale though so if you have headroom left on your processor clock and the power budget to support that then repeat the experiment in sram with a faster processor clock speed.