Question

In the internals of snappy, there is a conditionally compiled section that selects dereferencing a reinterpret_cast'ed pointer as the best implementation for reads and writes of potentially unaligned 16, 32, and 64 bit integers on architectures that are known to support such operations (like x86). The fallback for other architectures is to use a memcpy based implementation.

My understanding is that the reinterpret_cast implementation exhibits undefined behavior, and clang's undefined behavior sanitizer does flag it.

What is puzzling me though is: why not just use the memcpy based implementation? I would expect all but the most broken of compilers to use intrinsics to implement these memcpy calls, since the size is known at compile time. In fact I would expect identical codegen from both implementations on any modern toolchain.

However, I also recognize that snappy was written by folks who know what they are about. So this leaves me wondering whether there is still some advantage to using the reinterpret_cast mechanism that outweighs its being undefined behavior. Not wanting performance to depend on compiler quality of implementation? Something else I haven't considered?

Was it helpful?

Solution

Without knowing the programmer(s) who wrote that code in the first place, I doubt you can get a truly authoritative answer.

Here's my best guess: the authors didn't want to rely on a possible memcpy optimization (which is in no way guaranteed by the spec, even if it is implemented by many compilers). On the flip side, writing a reinterpret_cast is very, very likely to produce simply the unaligned access instruction that the authors were expecting, on practically any compiler.

While smart, modern compilers will optimize the memcpy, older ones may not. Consistent performance can be quite critical to this library, so they seem to have sacrificed some correctness (since the reinterpret_cast appears to be potentially UB) in favour of obtaining more consistent results across a wider set of compilers.

OTHER TIPS

The reason is that it's faster (on x86) to load an int from an unaligned address than to copy it and then load it.

The overhead of an unaligned load is about a factor 2. The memcpy boils down to 4 byte reads, 4 byte writes (or one 32 bit write, depending on compiler), and then you still need the load. In the best case, the optimizer may spot that the write-after-read is redundant.

Personally, I'd implement the safe method as 4 byte loads with shifts.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top