On a c++ efficient storage, flushing into file(s) strategy

https://stackoverflow.com/questions/22385872

14-06-2023
|

Question

Here is the situation: A c++ program is endlessly generating data in a regular fashion. The data needs to be stored in persistent storage very quickly so it does not impede the computing time. It is not possible to know the amount of data that will be stored in advance. After reading this and this posts, I end up following this naive strategy:

Creating one std::ofstream ofs
Opening a new file ofs.open("path/file", std::ofstream::out | std::ofstream::app)
Adding std::string using the operator <<
Closing the file has terminated ofs.close()

Nevertheless, I am still confused about the following:

Since the data will only be read afterwards, is it possible to use a binary (ios::binary) file storage? Would that be faster?
I have understood that flushing should be done automatically by std::ofstream, I am safe to use it as such? Is there any impact on memory I should be aware of? Do I have to optimize the std::ofstream in some ways (changing its size?)?
Should I be concerned about the file getting bigger and bigger? Should I close it at some point and open a new one?
Does using std::string have some drawbacks? Is there some hidden conversions that could be avoided?
Is using std::ofstream::write() more advantageous?

Thanks for your help.

Solution

1.Since the data will only be read afterwards, is it possible to use a binary (ios::binary) file storage? Would that be faster?

Since all the datatype on any storage device is binary telling compiler to save it so will result in more or less optimized saving of 0's & 1's. It depends on... many things and how you are going to use/read it after. Some of them listed in Writing a binary file in C++ very fast. When comes to storing on HD, perfomance of your code is always limited to speed of particular HD (which is widespread fact).

Try to give a "certainty/frames" to your questions, they are too general for stating as "problem"

OTHER TIPS

I'm probably not answering your direct questions, but please excuse me trying if I take a step back.

If I understand the issue correctly, the concern is about staying too long writing to disk that would delay the endless data generation.

Perhaps you can allocate a thread just for writing, while processing continues on the main thread.

The writer thread could awake at periodic intervals to write to disk what it has been generated so far.

Communication between the two threads can be either:

two buffers (one active where the generation happens, one frozen, ready to be written to disk on the next batch)
or a queue of data, inserted by the producer and removed by the consumer/writer.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow