Question

I have a large dataset (100k+ items) I want to serialize using Boost.Serialization. This works satisfactory.

Now when working with even larger datasets the entire set doesn't fit into the memory anymore (I currently store a std::map with all data in the archive). Since I neither need random reads or writes and only need to access one item at a time I thought about streaming the dataset by directly saving instances to the archive (archive << item1 << item2 ...) and unpacking them one-by-one.

The other option would be to develop a new file format from scratch (something simple like <length><block> where each <block> corresponds to one Boost.Serialization archive), because I noticed that it doesn't seem possible to detect the end of an archive in Boost.Serialization without catching exceptions (input_stream_error should be thrown on a read past the end of the archive, I think).

Which option is preferable to the other? Abusing Serialization archives for streaming seems odd and hacky but has the big advantage of not re-inventing the wheel, while the file format wrapping archives feels cleaner but more error-prone.

Was it helpful?

Solution

Using boost serialization for streaming is not abusing it and not odd either.

In fact, Boost Serialization has nothing but the streaming archive interface. So yes, the applicable approach would be to do as you said:

archive << number_of_items;
for(auto it = input_iterator(); it != end(); ++it)
    archive << *it;

In fact, very little stops you from doing the same in your serialize method. You could possibly even make it "automatic" by wrapping your stream into something (like an iterator_range?) and extending Boost Serialization to 'understand' these, like it 'understands' containers, arrays etc.

The file format approach is definitely not cleaner (from the library perspective) since it ruins the archive format isolation. The serialization library has been carefully designed to avoid knowledge about the archive representation, and it would be a breach of abstraction to circumvent this. Also see

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top