Threaded reading of files in C++

Question 1

The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,

Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.

since only 8 (1 per thread) should be open at once.

Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.

I am unsure if mmap would do this better?

No, it wouldn't. mmap is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.

mmap does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read copies the data from there to your application buffer while mmap exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read and mmap exactly the same way.

Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap can help, but don't expect much. The only way to improve I/O time is to get a faster disk.

You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.

Question 2

mmap is certainly the most efficient way to get large amounts of data into memory. The main benefit here is that there is no extra copying involved.

It does however make the code slightly more complex, since you can't directly use the file I/O functions to use mmap (and the main benefit is sort of lost if you use "m" mode of stdio functions, as you are now getting at least one copy). From past experiments that I've made, mmap beats all other file reading variants by some amount. How much depends on what proportion of the overall time is spent on waiting for the disk, and how much time is spent actually processing the file content.