The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,
Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.
since only 8 (1 per thread) should be open at once.
Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.
I am unsure if mmap would do this better?
No, it wouldn't. mmap
is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.
mmap
does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read
copies the data from there to your application buffer while mmap
exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read
and mmap
exactly the same way.
Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap
can help, but don't expect much. The only way to improve I/O time is to get a faster disk.
You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage
(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.