Google Protocol Buffers: ZigZag Encoding

https://stackoverflow.com/questions/4533076

13-10-2019
|

سؤال

From "Signed Types" on Encoding - Protocol Buffers - Google Code:

ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small varint encoded value too. It does this in a way that "zig-zags" back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on, as you can see in the following table:
Signed Original  Encoded As
0                0
-1               1
1                2
-2               3
2147483647       4294967294
-2147483648      4294967295
In other words, each value n is encoded using

(n << 1) ^ (n >> 31)

for sint32s, or

(n << 1) ^ (n >> 63)

for the 64-bit version.

How does (n << 1) ^ (n >> 31) equal whats in the table? I understand that would work for positives, but how does that work for say, -1? Wouldn't -1 be 1111 1111, and (n << 1) be 1111 1110? (Is bit-shifting on negatives well formed in any language?)

Nonetheless, using the fomula and doing (-1 << 1) ^ (-1 >> 31), assuming a 32-bit int, I get 1111 1111, which is 4 billion, whereas the table thinks I should have 1.

المحلول

Shifting a negative signed integer to the right copies the sign bit, so that

(-1 >> 31) == -1

Then,

(-1 << 1) ^ (-1 >> 31) = -2 ^ -1
                       = 1

This might be easier to visualise in binary (8 bit here):

(-1 << 1) ^ (-1 >> 7) = 11111110 ^ 11111111
                      = 00000001

نصائح أخرى

Another way to think about zig zag mapping is that it is a slight twist on a sign and magnitude representation.

In zig zag mapping, the least significant bit (lsb) of the mapping indicates the sign of the value: if it's 0, then the original value is non-negative, if it's 1, then the original value is negative.

Non-negative values are simply left shifted one bit to make room for the sign bit in the lsb.

For negative values, you could do the same one bit left shift for the absolute value (magnitude) of the number and simply have the lsb indicate the sign. For example, -1 could map to 0x03 or 0b00000011, where the lsb indicates that it is negative and the magnitude of 1 is left shifted by 1 bit.

The ugly thing about this sign and magnitude representation is "negative zero," mapped as 0x01 or 0b00000001. This variant of zero "uses up" one of our values and shifts the range of integers we can represent by one. We probably want to special case map negative zero to -2^63, so that we can represent the full 64b 2's complement range of [-2^63, 2^63). That means we've used one of our valuable single byte encodings to represent a value that will very, very, very rarely be used in an encoding optimized for small magnitude numbers and we've introduced a special case, which is bad.

This is where zig zag's twist on this sign and magnitude representation happens. The sign bit is still in the lsb, but for negative numbers, we subtract one from the magnitude rather than special casing negative zero. Now, -1 maps to 0x01 and -2^63 has a non-special case representation too (i.e. - magnitude 2^63 - 1, left shifted one bit, with lsb / sign bit set, which is all bits set to 1s).

So, another way to think about zig zag encoding is that it is a smarter sign and magnitude representation: the sign bit is stored in the lsb, 1 is subtracted from the magnitude of negative numbers, and the magnitude is left shifted one bit.

It is faster to implement these transformations using the unconditional bit-wise operators that you posted rather than explicitly testing the sign, special case manipulating negative values (e.g. - negate and subtract 1, or bitwise not), shifting the magnitude, and then explicitly setting the lsb sign bit. However, they are equivalent in effect and this more explicit sign and magnitude series of steps might be easier to understand what and why we are doing these things.

I will warn you though that bit shifting negative values in C / C++ is not portable and should be avoided. Left shifting a negative value has undefined behavior and right shifting a negative value has implementation defined behavior. Even left shifting a positive integer can have undefined behavior (e.g. - if you shift into the sign bit it might cause a trap or something worse). So, in general, don't bit shift signed types in C / C++. "Just say no."

Cast first to the unsigned version of the type to have safe, well-defined results according to the standard. This does mean that you then won't have arithmetic shift of negative values -- only logical shift, so you need to adjust the logic to account for that.

Here are the safe and portable versions of the zig zag mappings for 64b integers in C (note the arithmetic negation):

#include <stdint.h>

uint64_t zz_map( int64_t x )
{
  return ( ( uint64_t ) x << 1 ) ^ -( ( uint64_t ) x >> 63 );
}

int64_t zz_unmap( uint64_t y )
{
  return ( int64_t ) ( ( y >> 1 ) ^ -( y & 0x1 ) );
}

Let me add my two cents to the discussion. As other answers noted, the zig-zag encoding can be thought as a sign-magnitude twist. This fact can be used to implement conversion functions which work for arbitrary-sized integers. For example, I use the following code in one on my Python projects:

def zigzag(x: int) -> int:
    return x << 1 if x >= 0 else (-x - 1) << 1 | 1

def zagzig(x: int) -> int:
    assert x >= 0
    sign = x & 1
    return -(x >> 1) - 1 if sign else x >> 1

These functions work despite Python's int has no fixed bitwidth; instead, it extends dynamically. However, this approach may be inefficient in compiled languages since it requires conditional branching.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow