Fastest way to generate (machine wide) unique 64 bit timestamps on Windows

https://softwareengineering.stackexchange.com/questions/400100

03-03-2021
|

Question

Here are some of the characteristics of my problem:

I must generate unique 64 bit timestamps across processes on a single machine. I am very much limited to those 64 bits. This seems to rule out using GUIDs as part of the solutioning.
Uniqueness is critical, monotonicity (across processes) is not as important.
It is crucial that this code is as fast as possible (most likely ruling out e.g. a central timestamp service through some IPC mechanism). This has been proven to be in the hot path.
multiple instances of my process will run on same system. One thread on each will be tasked to generate these ids.

The way I have currently implemented it:

(if a process requests a timestamp) we

Grab an initial timestamp value (from the fastest available underlying clock source, e.g. GetSystemTimePreciseAsFileTime from kernel32)
Acquire a Global Mutex
In a while(true) { ... } loop, keep grabbing values from our underlying clock source until the value "rolls over" (e.g. changes from the value grabbed in (1)
Release the global mutex.

That should satisfy the uniqueness constraints as long as the time source is monotonic. As expected, there is still a real cost associated with global Mutex acquisition and release (kernel context switches etc.).

Is there any other approach or algorithm that I could use for the problem as stated above?

Solution

You mentioned in the comments you're OK to compress the timestamp to 52 bits. Few other people mentioned you can negotiate smaller process Ids.

I'm expanding on these ideas with one way to generate lock-free very high speed timestamps, not just machine-wide, but world-wide. That's because the processes don't share state apart from initial setup and thus can easily live in different machines.

Encode each timestamp like this:

 8 bits -> shortPid, a generated "process id" during startup
52 bits -> shortTime, the compressed timestamp
 4 bits -> localCounter, to resolve timestamp overlaps, and avoid waiting for the system clock

The algorithm:

Have a central "service" to generate shortPid-s, that each process will call on startup. Example pseudocode:

byte shortPid = 0;
byte getNewPid() {
    lock {
        shortPid++;
        return shortPid;
    }
}

On each process start, get their shortPid from the central service:

ulong processPidMasked = getNewPid() << (52 + 4); // get pid and move into first 8 bits of 64-bit ulong

To generate times efficiently, we'll use the 4-bit localCounter, to enable generating up to 16 (2^4) unique values per time tick. Assuming there's a function ulong getCompressedTimestamp() that returns a 52-bit timestamp, here's how to get a unique timestamp per process:

ulong localCounter = 0;
ulong lastCompressedTimestamp = 0;
ulong getUniqueTimestamp() {
    while (1) {
        ulong now = getCompressedTimestamp(); // assuming 52-bit timestamp
        if (now != lastCompressedTimestamp) { // if we haven't generated any timestamp this tick
            lastCompressedTimestamp = now;
            localCounter = 0; // reset to time slot 0
            return processPidMasked | (now << 4) | localCounter;
        }

        // we generated one or more unique stamps on this tick, see if there's any "free slots" left
        if (localCounter < 15) { // if we have free slots for this tick
            localCounter++;
            return processPidMasked | (now << 4) | localCounter;
        }

        // worst case scenario: we exhausted our local counter for this tick:
        // loop until we get to the next time slot
    }
}

Notes:

Critical code is lock-free and without explicit shared state. This will make generation very fast. The only lock is during startup.
You can modify how many bits you allocate for the compresed timestamp, localCounter, and shortProcessId to make it fit your use-case. With the current code, you'll be able to generate 16 timestamps per compressed-time-tick, which may fit your usage nicely.
Calling the OS time function may end up being the implicit shared state. I'm not sure how it will perform for your scenario.
Twitter uses (used?) similar algorithm for generating unique time-sortable Ids accross machines called Snowflake. They store some bits for machine id, some for time, and some for "sequence number" (our localCounter.) Similar example implementation I found by googling here: https://github.com/sony/sonyflake
With the above algorithm, you'll be able to theoretically generate maximum 1,600,000 unique timestamps per process per second. Hopefully enough for your case :)

OTHER TIPS

Well, first get rid of the lock.
Doing it lockless is probably much more efficient.

If that doesn't help enough, use batching.
Stockpile up to n consecutive timestamps for each local thread. Whenever the stockpile runs dry, refill from the shared global source.

The disadvantage is that the timestamps can lag too much behind the actual time for your comfort if the demand is irregular or low.

Potentially add a process-level stockpile between the shared global source and each thread.
Potentially refill the stockpile asynchronously when it falls below a threshold to avoid holdups.

Yes. Avoid the while loop and simply add a millisecond if multiple values are requested in the same tick.

aquire lock
read time
is <= lastTimeRead ? use lastTimeRead + 1ms
update lastTimeRead
release lock
return

As you want this across processes you need some interprocess communication.

The simplest way would be to run it as a seperate time service on the box and have it listen on a socket. if that's too slow, how about Com+?

Alternatively you could add a random number of milliseconds. Obviously this only gives you a chance of uniqueness. But perhaps you can pick up on the dupes at a later stage of processing?

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange