Question

I have to deal with very large text files (2 GBs), it is mandatory to read/write them line by line. To write 23 millions of lines using ofstream is really slow so, at the beginning, I tried to speed up the process writing large chunks of lines in a memory buffer (for example 256 MB or 512 MB) and then write the buffer into the file. This did not work, the performance is more or less the same. I have the same problem reading the files. I know the I/O operations are buffered by the STL I/O system and this also depends on the disk scheduler policy (managed by the OS, in my case Linux).

Any idea about how to improve the performance?

PS: I have been thinking about using a background child process (or a thread) to read/write the data chunks while the program is processing data but I do not know (mainly in the case of the subprocess) if this will be worthy.

Was it helpful?

Solution

A 2GB file is pretty big, and you need to be aware of all the possible areas that can act as bottlenecks:

  • The HDD itself
  • The HDD interface (IDE/SATA/RAID/USB?)
  • Operating system/filesystem
  • C/C++ Library
  • Your code

I'd start by doing some measurements:

  • How long does your code take to read/write a 2GB file,
  • How fast can the 'dd' command read and write to disk? Example...

    dd if=/dev/zero bs=1024 count=2000000 of=file_2GB

  • How long does it take to write/read using just big fwrite()/fread() calls

Assuming your disk is capable of reading/writing at about 40Mb/s (which is probably a realistic figure to start from), your 2GB file can't run faster than about 50 seconds.

How long is it actually taking?

Hi Roddy, using fstream read method with 1.1 GB files and large buffers(128,255 or 512 MB) it takes about 43-48 seconds and it is the same using fstream getline (line by line). cp takes almost 2 minutes to copy the file.

In which case, your're hardware-bound. cp has to read and write, and will be seeking back and forth across the disk surface like mad when it does it. So it will (as you see) be more than twice as bad as the simple 'read' case.

To improve the speed, the first thing I'd try is a faster hard drive, or an SSD.

You haven't said what the disk interface is? SATA is pretty much the easiest/fastest option. Also (obvious point, this...) make sure the disk is physically on the same machine your code is running, otherwise you're network-bound...

OTHER TIPS

I would also suggest memory-mapped files but if you're going to use boost I think boost::iostreams::mapped_file is a better match than boost::interprocess.

Maybe you should look into memory mapped files.

Check them in this library : Boost.Interprocess

Just a thought, but avoid using std::endl as this will force a flush before the buffer is full. Use '\n' instead for a newline.

Don't use new to allocate the buffer like that:

Try: std::vector<>

unsigned int      buffer_size = 64 * 1024 * 1024; // 64 MB for instance.
std::vector<char> data_buffer(buffer_size);
_file->read(&data_buffer[0], buffer_size);

Also read the article on using underscore in identifier names:. Note your code is OK but.

Using getline() may be inefficient because the string buffer may need to be re-sized several times as data is appended to it from the stream buffer. You can make this more efficient by pre-sizing the string:

Also you can set the size of the iostreams buffer to either very large or NULL(for unbuffered)

// Unbuffered Accesses:
fstream file;
file.rdbuf()->pubsetbuf(NULL,0);
file.open("PLOP");

// Larger Buffer
std::vector<char>  buffer(64 * 1024 * 1024);
fstream            file;
file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());
file.open("PLOP");

std::string   line;
line.reserve(64 * 1024 * 1024);

while(getline(file,line))
{
    // Do Stuff.
}

If you are going to buffer the file yourself, then I'd advise some testing using unbuffered I/O (setvbuf on a file that you've fopened can turn off the library buffering).

Basically, if you are going to buffer yourself, you want to disable the library's buffering, as it's only going to cause you pain. I don't know if there is any way to do that for STL I/O, so I recommend going down to the C-level I/O.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top