Question

I have a very big table:

id1 id2 count
1   234   4
1    5    123
1   432   5
23  234   7

id1 and id2 has many different values. count has limited number values(1-30000 or something) and i know that most of them equal 1.

This table take about 10gb of memory when i store it in .net dictionary. I want to found memory efficient data structure to store this data.

Perfect hash might be ideally solution but problem is collisions. I can get values for ids which doesnt present in table. Maybe DAWG can help? Or something else?


Main purpose of data structure is to to take count by id1 and id2.

Was it helpful?

Solution

If almost all of the counts are 1, then you could use two datastructures: a HashSet which contains all the id-pairs whose count is 1, and a Dictionary for the id-pairs whose count is greater than 1. This makes incrementing and checking counts a bit slower, but it should save some space. (I don't know how .Net datastructures are laid out internally, so I hesitate to make a guess, but if it were C++, I'd say it would reduce space consumption by something like 25-30%, depending on the value of "almost all".)

If that's not enough space savings, here's an outline of some possibilities, although it can be a lot of work for uncertain gain:

In general, the cost of a container datastructure is composed of the size of the data of an element, plus some per-element overhead, plus some per-datastructure overhead. Hash tables have a medium amount of per-element overhead (one link to the next element in the bucket, plus allocation/alignment overhead); and binary trees have a lot of per-element overhead (two or more commonly three links, plus allocation/alignment overhead). Vectors technically have no per-element overhead, but they are usually overallocated to reduce insertion time, so you should think of them as having 50-100% per-element overhead.

One consequence is that if you an figure out a way to reduce the number of elements, you can often save space. For example, you could use a HashSet of id-pairs, as I suggested above. But if there are a lot fewer individual id1 values than pairs -- i.e., if the ids repeat -- then you could replace that with a dictionary mapping id1 to a vector of id2, which might reduce overhead. There's a big downside to this: it makes lookup and insert much more expensive; furthermore, it only helps if the hash-table per-element overhead is more than the expected vector overallocation overhead.

OTHER TIPS

Do you have a reasonably small upper bound on id1 and id2? If so then you could store them as a single number; for example if you had an upper bound of 255 on both numbers, then you could store them as id = id1 + id2 * 256; if need be you can then extract id1 = id % 256 and id2 = id / 256 (using integer division)

Now that you have a single index for each id pair, and because most of the counts are 1, you can store this as a sparse array (usually the "empty" values of a sparse array are 0 or null, in your case they're 1)

If there isn't a good way to combine the two ids into a single index, then you can store this as a sparse matrix with id1 as the x value and id2 as the y value (or vice versa)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top