WaitForSingleObject
does not have to be faster. It covers a much wider scope of synchronization scenarios, in particular you can wait on handles which do not "belong" to your process and hence interprocess synchronization. Taking all this into consideration it is only 38% slower according to your test.
If you have everything inside your process and every nanosecond counts, InterlockedXxx
might be a better option, but it's definitely not absolutely superior one.
Additionally, you might want to look at Slim Reader/Writer (SRW) Locks API. You will perhaps be able to build a similar class/functions based purely on InterlockedXxx
with slightly better performance, however the point is that with SRW you get it ready to use out of the box, with documented behavior, stable and with decent performance anyway.