Question

I'm currently writing a program which shall be able to handle out of core data. So I'm processing files from the size starting at 1MB up to 50GB (and possibly larger in future).

I have read several tutorials regarding memory mapped files and am now using the the memory mapped files for managing data IO , i.e. reading and writing data from/to the hard drive.

Now I also process the data and need some temporary arrays of the same size as the data is. My question now is, if I should also use memory mapped files for that or if I should somehow get it managed by the OS without explicitly defining memory mapped files. The problem is as follows:

I'm working on multiple platforms but always with 64bit systems. In theory, the 64bit virtual address space is definetly sufficient for my needs. However, in Windows the maximum virtual address space seems to be limited by the operating system, i.e. a user can set, if paging is allowed and which maximum virtual memory size is allowed. Also I read somewhere, that the maximum virtual memory in Windows 64 isn't 2^64 but somewhere by 2^40 or similar, which would still be sufficient for me, but seems to be a quite odd limitation. Furthermore, Windows has some strange limitations such as arrays with a maximum size of 2^31 elements, independent of the array type. I don't know how all of thisis handled on linux, but I think its treated similar. Probably the maximum allowed virtual memory=OS-RAM+Swap partition size? So there are a lot of things to struggle with if I want to use the system to handle my data exceeding the ram size. I don't even know if I can use in c++ the entire 64bit virtual address space somehow. In my short test, I got an compiler error not being able to initialze mot than 2^31 elements, but I think, it is easy to go beyond that by using std::vector and such.

However, on the other hand, by using a memory mapped file, it will always be data written to the hdd with all my memory write operations. Especially for data which is smaller then my physical RAM, this is supposed to be a fairly huge bottleneck. Or does it avoid writing until it has to because the RAM is exceeded??? Memory mapped files advantages come up in inter process communications with shared memory or temporal communications such that I start the application, write something, quit the application and later restart it and read efficiently only those data to RAM which I need. As I need to process the entire data and only in one execution instance with one process, both advantages don't come up in my case.

Note: A streaming approach as alternate solution to my problem is not really feasible as I heavily depend on random access to the data.

What I ideally would like to have is a way that I can process all models independent of their size and operating limit set limitations but handle all whats possible in the RAM and only if the physical limit is exceeded, use memory mapped files or other mechanisms (if there are any others) for paging out the RAM exceeding data, ideally managed by the operating systemm.

To conclude, whats the best approach to handle this temporary existing data? If I can do it without memory mapped files and platform independent, can you give me any code snippet or something like this and explain how it works to avoid these OS limitations?

Was it helpful?

Solution 2

Maybe a bit late, but it's an interesting question.

However, on the other hand, by using a memory mapped file, it will always be data written to the hdd with all my memory write operations. Especially for data which is smaller then my physical RAM, this is supposed to be a fairly huge bottleneck. Or does it avoid writing until it has to because the RAM is exceeded???

To avoid writing to disk while there's enough memory, you should open the file as 'temporary' (FILE_ATTRIBUTE_TEMPORARY) with FILE_FLAG_DELETE_ON_CLOSE. This will hint the OS to delay writing to disk as long as possible.

As for limitations on array size: it's probably best to provide your own datastructures and access to the mapped views. For big datasets you may want to use several different (smaller) mapped views, which you can map and unmap as needed.

OTHER TIPS

As nobody answered, I will update the status of the question myself.

After I luckily came in contact with the boost interprocess library today, I found managed_mapped_file which even allow me to allocate vectors in the mapped range which makes them nearly as easy to use as programming without mapped files at all.

Additionally, I found that:

If several processes map the same file, and a process modifies a memory range from a mapped region that is also mapped by other process, the changes are inmedially visible to other processes. However, the file contents on disk are not updated immediately, since that would hurt performance (writing to disk is several times slower than writing to memory). If the user wants to make sure that file's contents have been updated, it can flush a range from the view to disk.

http://www.boost.org/doc/libs/1_54_0/doc/html/interprocess/sharedmemorybetweenprocesses.html

So hopefully, it starts writing only once I exceed the system physical RAM. I haven't done any speed measurements yet and won't probably do some of them.

I can live with this solution now quite well. However, I will leave this question as unanswered and opened. At some point, somebody might find the question and can give any more hints such as how to prevent flushing of the data up to the point that it is actually necesssary or has some other ideas/tips how to handle the out of core data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top