How about something like this:
typedef unsigned int word;
int spacePad(word input) {
static const word spaces = 0x20202020;
word mask =
!input ? 0 :
!(input & 0x00ffffff) ? 0xff:
!(input & 0x0000ffff) ? 0xffff :
!(input & 0x0000ff) ? 0xffffff :
0xffffffff;
// or without branches
word branchless_mask =
1u << (8 * (
bool(input & 0xff000000) +
bool(input & 0x00ff0000) +
bool(input & 0x0000ff00) +
bool(input & 0x000000ff)
));
return (spaces & mask) | (input & ~mask);
}
And if I didn't screw up, spacePad(0xaabb0000)
is 0xaabb2020
.
Instead of computing and-masks, you could use SSE intrinsics which would probably be faster since you'd get the mask in a couple of instruction, and then masked move would do the rest, but the compiler would probably move your variables arround from SSE to standard registers which could outweight the slight gain. It all depends on how much data you need to process, how it's packed in memory, etc.
If the input in a char*
and not an int
, normally additionnal code would be necessary since a cast could read into unallocated memory. But since you mention all strings are word-aligned a cast is enough, indeed even if there are a few unallocated bytes, they are on the same word as at least one allocated byte. Since you are only reading there's no risk of memory corruption and on all architectures I know of, hardware memory protection has a granularity larger than a word. For instance on x86 a memory page is often 4k aligned.
Now that's all nice and hacky, but: before selecting a solution, benchmark it, that's the only way to know which is best for you (except of course the warm fuzzy feeling of writing code like this ^^)