Question

From my previous question: Data structure for storing huge number of indices, each pointing to a set, I got my answer on a suitable data structure for the invert index implementation. However, the thing is, we may hit our 128 GB RAM limit on our Linux server soon, so I would like to be prepare for it, in case, we ran out of memory with it once again.

Right now, we got ourselves total number of indices in the invert index as high as 3.9 billion and that takes about 50 GB of our RAM. Note that, while some people may suggest for database system and such, this is for experimental research, we would like to manage our own data, and we will NOT be using any sort of database system.

I have also been pointed to When should I use mmap for file access? While this looks promising, I googled around, and see that I will need to allocate a fixed space for mmap first, and then start putting data in. However, my first problem (1) is as we have bigger data, I know that my invert index will be bigger, but I do not know the exact number until I built it. (some data need to be processed first before pushing such data into the invert index) I can allocate a lot of memory for it, but hey, we got 50 GB of RAM with the current invert index alone already. And that leads to second problem (2), our server has a lot of people using, and with 50 GB of space or more, our data will become fragmented all around in the hard disks.

Alternatively, what if I use file I/O to manage this and make a B-Tree like a hierarchical directory? Things might become ugly...

So this time, I would like to ask for some suggestion just like in my previous question above, but this time, I will need to swap some data around between RAM and hard disk, our 128 GB RAM might not hold this.

Was it helpful?

Solution

I would add more swap space to system and let kernel take care of swapping, if that is possible.

If it is not possible I would think about clustering data in blocks by index key, and than compressing/decompressing blocks in memory on access, or swapping them out to disk.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top