What is the best way to binarize data

https://stackoverflow.com/questions/5326708

25-10-2019
|

Question

I have some data files that are written as tag = value, where tag is string and value may be numeric, string, array, etc. I use this format because is readable and can be edited easy. Now every class that is instantiated using this format has a load method and it reads the tags that it needs and use the values found within these tags. I want to make data binary to increase loading speed. One way would be to have a ToBinary(the name does not matter) method in every class that reads the old data and write it in a file and the new file is used to instantiate the object. This can be done offline, only once/application. Do you have other suggestions for this? I use C++ for this.

Edit: I think the most expensive part now is to parse the file when I first read it and after that to search the tag that I need, not to read the file from disk. I can use custom file system to have multiple small files in one big file.

Solution

I have a serialization base class for this, with To/From functions with a small header where version handling can be embedded. I think its a good system for simpler data that needs to be stored locally and in most cases is "read only".

Something like this:

class SeralizeMe
{
public:

 virtual bool To(Archive &file)=0;
 virtual bool From(Archive &file)=0;

 virtual bool NeedsSave(void)=0;

};

However, do not use this system if you:

Needs to change the format often.
Needs to select what data to load and what to store.
Use large files, which is particulare sensitive to power outages while saving.

If any of above apply , use a database , FirebirdSQL embedded is a suitable contender.

OTHER TIPS

I haven't used it before, but I'm sure Boost's Serialization module is a good place to start.

If you are using a file, then using binary data will probably not improve your performances significantly, unless you have very large chunk of data to store in the file (images, videos ...).

But anyway you can use a binary serialization algorithm, such as the one from Boost.

Another one is protobuf from google. Not the fastest, but it can support evolving data types and is very efficient over the network and supports other languages.

Link here.

If you want to improve performance, you will have to use fixed length fields. Parsing or loading variable length fields does not provide a significant increase in performance. Reading by text line involves scanning for the end of line token. Scanning wastes time.

Before using any of the following suggestions, profile your code to establish a baseline time or number for performance. Do this after each suggestion, as it will enable you to calculate the performance delta of each optimization. My prediction is that the delta will become smaller with each optimization.

I suggest first converting the file to fixed length records, still using text. Pad fields with spaces as necessary. Thus, knowing the size of a record, you can block read into memory and treat the memory as an array. This should provide a significant improvement.

At this point, your bottlenecks are still file I/O speed, which you can't really make significant improvements on (because file I/O is controlled by the OS), and scannning / converting text. Some further optimizations are: convert text to numbers and finally converting to binary. At all costs, prefer to keep the data file in human readable form.

Before making the data file any less readable, try splitting your application into threads. One thread handles the GUI, another the input, and another for the processing. The idea is have the processor always executing some of your code rather than waiting. In modern platforms, file I/O can be performed while the CPU is processing your code.

If you don't care about portability, see if your platform has DMA capability (a DMA or Direct Memory Access component allows data transfers without using the processor or minimizing use of the processor). Something to watch out for is that many platforms share the address and data bus between processor and DMA. Thus the one component is blocked, or suspended while the other uses the address and data buses. So it may help or not. Depends on how the platform is wired up.

Convert the key field to use numbers, a.k.a. tokens. Since the tokens are numeric, they can be used as indices into jump tables (also switch statements) or just indices into arrays.

As a last resort, convert the file to binary. The binary version should have two fields: key as token, and value. Haul in the data in large chunks into memory.

Summary

Haul large blocks of data into memory.
Profile before making changes to establish a baseline performance measurement.
Optimize one step at a time, profiling after each optimization.
Prefer to keep data file in human readable form.
Minimize changes to the data file.
Convert file to use fixed length fields.
Try using threads or multi-tasking so application is not waiting.
Convert text to numeric tokens (reduces human readability)
Convert data to binary as a last resort (very difficult for humans to read & debug).

I have two ideas for you:

1) If the list of Tags is constant and known in advance, you could convert each one into a BYTE, (or WORD), followed by the length of the value (in bytes), followed by the raw c-string of the value's payload.

For instance, given the following:

Tag1 = "hello World!" // 12 bytes in length (achieved by "strlen(value) * sizeof(char)" )
Tag2 = "hello canada!"  // 13 bytes in length

You could turn this into the byte stream:

0x0001 0x000B // followed by 12 bytes representing the 'value' // 0x0002 0x000C // followed by 13 bytes representing 'value2'

Your program would just need to know that the WORD header "0x0001" represents Tag1, and the header "0x0002" represents Tag2.

You could even further abstract the Tag names if you don't know them in advance, by following a similar length,value structure.

2) Perhaps the slow bit is just your implementation of the text parsing? Consider using a dedicated open source library for what you are trying to do. Example: "boost::property_tree"

Property tree is specifically designed to store and retrieve Key Value pairs (designed for use as configuration settings file). But I guess it would depend how many such pairs you are trying to store for this to be economical.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow