Encoding MessagePack objects containing Node.js Buffers

Question

The buffer corruption problem occurs when binary data is decoded using any encoding text except base64 and hex. which don't seem to be picked up by node-msgpack. It seems to automatically try to use 'utf-8', which irreversibly screws up the buffer. They had to do something like that so we don't end up with a bunch of buffer objects instead of ordinary strings, which is mostly what of our msgpack objects are usually made of.

EDIT:

The three bytes that were shown above to be problematic represent the UTF-8 Replacement Character. A quick test shows that this character was to replace the unrecognizable 0x89 byte at the start :

new Buffer((new Buffer('89', 'hex')).toString('utf-8'), 'utf-8')
//> <Buffer ef bf bd>

This line of C++ code from node-msgpack is responsible for this behavior. When intercepting a Buffer instance in a data structure given to the encoder, it just bindly converts it to a String, equivalent to executing buffer.toString() which by default assumes UTF-8 encoding, replacing every unrecognizable characters with the above.

The alternative module suggested below works around this by leaving the buffer as raw bytes, not trying to convert it to a string, but by doing so is incompatible with other MessagePack implementation. If compatibility is an concern, a work around this would be to encode non-UTF-8 buffers ahead of time with a binary-safe encoding like binary, base64 or hex. base64 or hex will inevitably grow the size of the data by a significant amount, but will leave it consistent and are safest to use when transporting data across HTTP. If size is a concern as well, piping the MessagePack result through a streaming compression algorithm like Snappy can be a good option.

Turns out another module, msgpack-js (which is a msgpack encoder/decoder all written in javascript), leaves raw binary data as such, hence solving the above problem. Here's how he did it:

I've extended the format a little to allow for encoding and decoding of undefined and Buffer instances.

This required three new type codes that were previously marked as "reserved". This change means that using these new types will render your serialized data incompatible with other messagepack implementations that don't have the same extension.

As a bonus, it's also more performant than the C++ extension-based module mentionned earlier. It's also much younger, so maybe not as thoroughly tested. Time will tell. Here is the result of a quick benchmark I did, based off the one that was included in node-msgpack, comparing the two libraries (as well as native JSON parser) :

node-msgpack pack:   3793 ms
node-msgpack unpack: 1340 ms

msgpack-js pack:   3132 ms
msgpack-js unpack: 983 ms

json pack:   1223 ms
json unpack: 483 ms

So while we see a performance improvement with the native javascript msgpack decoder, JSON is still way more performant.

Encoding MessagePack objects containing Node.js Buffers

EDIT: