Does it make sense to cache data obtained from a memory mapped file?

https://stackoverflow.com/questions/2516443

memory-mapped-files

22-09-2019
|

Question

Or it would be faster to re-read that data from mapped memory once again, since the OS might implement its own cache?

The nature of data is not known in advance, it is assumed that file reads are random.

Solution

i wanted to mention a few things i've read on the subject. The answer is no, you don't want to second guess the operating system's memory manager.

The first comes from the idea that you want your program (e.g. MongoDB, SQL Server) to try to limit your memory based on a percentage of free RAM:

Don't try to allocate memory until there is only x% free

Occasionally, a customer will ask for a way to design their program so it continues consuming RAM until there is only x% free. The idea is that their program should use RAM aggressively, while still leaving enough RAM available (x%) for other use. Unless you are designing a system where you are the only program running on the computer, this is a bad idea.

(read the article for the explanation of why it's bad, including pictures)

Next comes from some notes from the author of Varnish, and reverse proxy:

Varnish Cache - Notes from the architect

So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.

What happens is this: Squid creates a HTTP object in "RAM" and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive.

Imagine you do cache something from a memory-mapped file. At some point in the future that memory holding that "cache" will be swapped out to disk.

the OS has written to the hard-drive something which already exists on the hard drive

Next comes a time when you want to perform a lookup from your "cache" memory, rather than the "real" memory. You attempt to access the "cache", and since it has been swapped out of RAM the hardware raises a PAGE FAULT, and cache is swapped back into RAM.

your cache memory is just as slow as the "real" memory, since both are no longer in RAM

Finally, you want to free your cache (perhaps your program is shutting down). If the "cache" has been swapped out, the OS must first swap it back in so that it can be freed. If instead you just unmapped your memory-mapped file, everything is gone (nothing needs to be swapped in).

in this case your cache makes things slower

Again from Raymon Chen: If your application is closing - close already:

When DLL_PROCESS_DETACH tells you that the process is exiting, your best bet is just to return without doing anything

I regularly use a program that doesn't follow this rule. The program allocates a lot of memory during the course of its life, and when I exit the program, it just sits there for several minutes, sometimes spinning at 100% CPU, sometimes churning the hard drive (sometimes both). When I break in with the debugger to see what's going on, I discover that the program isn't doing anything productive. It's just methodically freeing every last byte of memory it had allocated during its lifetime.

If my computer wasn't under a lot of memory pressure, then most of the memory the program had allocated during its lifetime hasn't yet been paged out, so freeing every last drop of memory is a CPU-bound operation. On the other hand, if I had kicked off a build or done something else memory-intensive, then most of the memory the program had allocated during its lifetime has been paged out, which means that the program pages all that memory back in from the hard drive, just so it could call free on it. Sounds kind of spiteful, actually. "Come here so I can tell you to go away."

All this anal-rententive memory management is pointless. The process is exiting. All that memory will be freed when the address space is destroyed. Stop wasting time and just exit already.

The reality is that programs no longer run in "RAM", they run in memory - virtual memory.

You can make use of a cache, but you have to work with the operating system's virtual memory manager:

you want to keep your cache within as few pages as possible
you want to ensure they stay in RAM, by the virtue of them being accessed a lot (i.e. actually being a useful cache)

Accessing:

a thousand 1-byte locations around a 400GB file

is much more expensive than accessing

a single 1000-byte location in a 400GB file

In other words: you don't really need to cache data, you need a more localized data structure.

If you keep your important data confined to a single 4k page, you will play much nicer with the VMM; Windows is your cache.

When you add 64-byte quad-word aligned cache-lines, there's even more incentive to adjust your data structure layout. But then you don't want it too compact, or you'll start suffering performance penalties of cache flushes from False Sharing.

OTHER TIPS

The answer is highly OS-specific. Generally speaking, there will be no sense in caching this data. Both the "cached" data as well as the memory-mapped can be paged away at any time.

If there will be any difference it will be specific to an OS - unless you need that granularity, there is no sense in caching the data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow