Question

I am going to store some big objects into database (BLOB). And protobuf is, as I see it, one of the best candidates to serialize/deserialize BLOB. Despite it has binary format, it is still easy to read and to change its content (strings, integers, etc). So I need some kind of data validation, whenever its original BLOB or modified (by hacker? by too smart user?).

One possibility would be to have a dedicated field in the table, call it crc, calculate checksum of BLOB an put it there. But it would be much better (in many scenarios), when crc is a part of BLOB itself.

I can add extra bytes to the end of protobuf stream, but I will have to delete them (or deserializer will throw exception "invalid field blablabla").

I can put protobuf stream into a wrapper, but it is again overhead to unwrap/wrap.

Is there an easy and cheap way to add something to the end of protobuf stream to avoid needs of additional operations during deserialization? In XML, I could add comment. I don't think there is a comment in protobuf, but how to put CRC which will be 1 or 2 bytes to example?

Was it helpful?

Solution

Protobuf streams are appendable. If you know a field number that doesn't exist in the data, you can simply append data against that field. If you are intending to add 1 or 2 bytes of CRC data, then a "varint" is probably your best bet (note that "varint" is a 7 bit encoding format with the 8th bit a continuation marker, so you probably want to use 7, 14 or 21 bits or actual CRC data), then you can just append:

  • the chosen field number, left-shifted 3 bits, then varint encoded
  • the CRC data, varint encoded

However! The wrinkle in this is that the decoder will still often interpret and store this data, meaning that if you serialize it, it will include this data in the output.

The other approach, which avoids this, would be to encapsulate the protobuf data in some framing mechanism of your own devising. For example, you could choose to do:

  • 4 bytes to represent the protobuf payload length, "n"
  • "n" bytes of the protobuf payload
  • 2 bytes of CRC data calculated over the "n" bytes

I'd probably go with the second option. Note that you could choose "varint" encoding rather than fixed length encoding for the length prefix if you want. Probably not worth it for the CRC, though, since that will be fixed length.

OTHER TIPS

Crc should be saved before. This makes deserialization from stream trivial by using Seek (to skip header).

Here is simplest implementation:

// serialize
using (var file = File.Create("test.bin"))
using (var mem = new MemoryStream())
{
    Serializer.Serialize(mem, obj); // serialize obj into memory first
    // ... calculate crc
    file.Write(new byte[] { crc }, 0, 1);
    mem.WriteTo(file);
}

// deserialize
using (var file = File.OpenRead("test.bin"))
{
    var crc = file.ReadByte();
    // ... calculate and check crc
    file.Seek(1, SeekOrigin.Begin);
    Serializer.Deserialize<ObjType>(file);
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top