Convert a 64bit integer to an array of 7bit-characters

https://stackoverflow.com/questions/22884546

28-06-2023
|

Question

Say I have a function vector<unsigned char> byteVector(long long UID), returning a byte presentation of the UID, a 64bit integer, as a vector. This vector is later on used to write this data to a file.

Now, because I decided I want to read that file with Python, I have to comply to the utf-8 standard, which means I can only use the first 7bits of each char. If the highest significant bit is 1 I can't decode it to a string anymore, because those are only supporting ASCII-characters. Also, I'll have to pass those strings to other processes via a Command Line Interface, which also is only supporting the ASCII-character set.

Before that problem arose, my approach on splitting the 64bit integer up into 8 separate bytes was the following, which worked really great:

vector<unsigned char> outputVector = vector<unsigned char>();
unsigned char * uidBytes = (unsigned char*) &UID_;
for (int i = 0; i < 8; i++){
    outputVector.push_back(uidBytes[i]);
}

Of course that doesn't work anymore, as the constrain "HBit may not be 1" limits the maximum value of each unsigned char to 127. My easiest option now would of course be to replace the one push_back call with this:

outputVector.push_back(uidBytes[i] / 128);
outputVector.push_back(uidBytes[i] % 128);

But this seems kind of wasteful, as the first of each unsigned char pair can only be 0 or 1 and I would be wasting some space (6 bytes) I could otherwise use.

As I need to save 64 bits, and can use 7 bits per byte, I'll need 64//7 + 64%7 = 10 bytes.

It isn't really much (none of the files I write ever even reached the 1kB mark), but I was using 8 bytes before and it seems a bit wasteful to use 16 now when ten (not 9, I'm sorry) would suffice. So:

How do I convert a 64bit integer to a vector of ten 7-bit integers?

This is probably too much optimization, but there could be some very cool solution for this problem (probably using shift operators) and I would be really interested in seeing it.

Solution

You can use bit shifts to take 7-bit pieces of the 64-bit integer. However, you need ten 7-bit integers, nine is not enough: 9 * 7 = 63, one bit short.

  std::uint64_t uid = 42; // Your 64-bit input here.
  std::vector<std::uint8_t> outputVector;

  for (int i = 0; i < 10; i++)
  {
    outputVector.push_back(uid >> (i * 7) & 0x7f);
  }

In every iteration, we shift the input bits by a multiple of 7, and mask out a 7-bit part. The most significant bit of the 8-bit numbers will be zero. Note that the numbers in the vector are “reversed”: the least significant bits have the lowest index. This is irrelevant though, if you decode the parts in the correct way. Decoding can be done as follows:

std::uint64_t decoded = 0;
for (int i = 0; i < 10; i++)
{
  decoded |= static_cast<std::uint64_t>(outputVector[i]) << (i * 7);
}

Please note that it seems like a bad idea to interpret the resulting vector as UTF-8 encoded text: the sequence can still contain control characters and and \0. If you want to encode your 64-bit integer in printable characters, take a look at base64. In that case, you will need one more character (eleven in total) to encode 64 bits.

OTHER TIPS

I suggest using assembly language.

Many assembly languages have instructions for shifting a bit into a "spare" carry bit and shifting the carry bit into a register. The C language has no convenient or efficient method to do this.

The algorithm:

for i =  0; i < 7; ++i
{
  right shift 64-bit word into carry.
  right shift carry into character.
}

You should also look into using std::bitset.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow