The buffer corruption problem occurs when binary data is decoded using any encoding text except base64
and hex
. which don't seem to be picked up by node-msgpack. It seems to automatically try to use 'utf-8', which irreversibly screws up the buffer. They had to do something like that so we don't end up with a bunch of buffer objects instead of ordinary strings, which is mostly what of our msgpack objects are usually made of.
EDIT:
The three bytes that were shown above to be problematic represent the UTF-8 Replacement Character. A quick test shows that this character was to replace the unrecognizable 0x89
byte at the start :
new Buffer((new Buffer('89', 'hex')).toString('utf-8'), 'utf-8')
//> <Buffer ef bf bd>
This line of C++ code from node-msgpack is responsible for this behavior. When intercepting a Buffer
instance in a data structure given to the encoder, it just bindly converts it to a String
, equivalent to executing buffer.toString()
which by default assumes UTF-8
encoding, replacing every unrecognizable characters with the above.
The alternative module suggested below works around this by leaving the buffer as raw bytes, not trying to convert it to a string, but by doing so is incompatible with other MessagePack implementation. If compatibility is an concern, a work around this would be to encode non-UTF-8
buffers ahead of time with a binary-safe encoding like binary
, base64
or hex
. base64
or hex
will inevitably grow the size of the data by a significant amount, but will leave it consistent and are safest to use when transporting data across HTTP. If size is a concern as well, piping the MessagePack result through a streaming compression algorithm like Snappy can be a good option.
Turns out another module, msgpack-js (which is a msgpack encoder/decoder all written in javascript), leaves raw binary data as such, hence solving the above problem. Here's how he did it:
I've extended the format a little to allow for encoding and decoding of undefined and Buffer instances.
This required three new type codes that were previously marked as "reserved". This change means that using these new types will render your serialized data incompatible with other messagepack implementations that don't have the same extension.
As a bonus, it's also more performant than the C++ extension-based module mentionned earlier. It's also much younger, so maybe not as thoroughly tested. Time will tell. Here is the result of a quick benchmark I did, based off the one that was included in node-msgpack, comparing the two libraries (as well as native JSON parser) :
node-msgpack pack: 3793 ms
node-msgpack unpack: 1340 ms
msgpack-js pack: 3132 ms
msgpack-js unpack: 983 ms
json pack: 1223 ms
json unpack: 483 ms
So while we see a performance improvement with the native javascript msgpack decoder, JSON is still way more performant.