Calculating HashMap overhead in Java

https://stackoverflow.com/questions/11565554

21-06-2021
|

Question

Let's say I'm storing 1000 objects in a hashmap. This hashmap is extended to allow me to map three dimensional coordinates to the objects stored in it; the objects inside have a fixed size. The hash key is a long integer.

How would I go about figuring out (mathematically) the probable overhead for this structure?

Is it significant enough that, for instance, if the data inside is around 256mb that the overhead will matter?
Is there a reliable way (Aside from a profiler, which I've found are unreliable in some cases) to mathematically calculate what its overhead should be?

I'm not interested in the total size of the hashmap - only the overhead that using the hashmap will incur. For instance, if I have 10 ints they're 4 bytes a piece, so it's 40 bytes. If I stick them in an array, I get a constant overhead of 12 bytes - 8 for the object header, 4 for the length. If I put them in another structure (a TreeSet for instance) my overhead will not be constant because a tree needs nodes - so I might get an overhead expressed in terms of n where n is the number of items in the set.

A few things are obvious to me, which I'll give as my starting point here.

I will need to store at least 1000 longs. These are nullable types, so they're actually objects. I will assume therefore that the 8 byte long integer being used has an object header also of 8 bytes. I will add a factor of 16n.
I will need references to every object as well, which must exist whether or not the object has been recalled from the map and is being used; so that's an additional 8 bytes per object. We could factor it into the data size instead, but since the references are in the hashmap itself, I feel like it's best to make them part of the overhead. My logic is as follows: If I took all of the data out of the hashmap and stored it in variables, those n references would still exist in the hashmap, provided I didn't remove these data objects, which I won't be doing. The set of objects is constant, though they may be recycled with a different key.
The hashmap itself has an overhead of 8 bytes.
The hashmap must store the number of items inside (or so I think!) so that's 4 bytes.
I will suppose ignorantly that the hash keys are in an array, sorted by hash key order. That's 12 bytes for the array.
I will assume ignorantly as well that the objects are in a matching array, which it dereferences when it finds the key. I will guess another 12 bytes.

This gives me a polynomial equation: 36 + 24n

Thus I have a guess of 24036 bytes overhead for 1000 data objects using long keys. This is somewhat of an insignificant overhead, but my question for you is, what is the real overhead, just sitting there?

A secondary valid question is, how much does this vary from JVM to JVM? Is there any JVM independent way to figure it out? To exemplify what I mean, consider a JVM that only has 32bit object headers - when looking at arrays you might say, even though the size varies from JVM to JVM, it's a fair estimate that the overhead on an array would become 8 bytes instead of 12 in that case.

I'm assuming a fixed implementation of HashMap across the same version of Java.

I could try to read the source code or run profiling, this however may produce misleading results based on my JVM. I'm asking for your help - perhaps someone who knows - for some piece of info that we both don't already know about the situation. Thanks!

See the answer below, the actual estimate can be expressed as follows:

8 words per entry, plus 8 bytes for each long, plus 8 bytes for the hashmap object header.

In my present environment (32 bit OS) that makes 1 word = 4 bytes.

40n + 8 in a 32bit environment: ~ 40k for 1000 entries
72n + 8 in a 64bit environment: ~ 72k for 1000 entries.

So it seems to be under 100kbytes.

Solution

The following blog post provides some loose mathematics on the topic.
This google code site provides a look at how these things are done.

Quoting the links in case of link rot:

This is the cheat-sheet I compiled.

To compute the cost of a single (key, value) entry:

    If you use HashMap or ConcurrentHashMap, the cost is 8 words (32 bytes)


 So, consider this example from the javadoc:

   LoadingCache graphs = CacheBuilder.newBuilder()
       .maximumSize(10000)
       .expireAfterWrite(10, TimeUnit.MINUTES)
       .removalListener(MY_LISTENER)
       .build(
           new CacheLoader() {
             public Graph load(Key key) throws AnyException {
               return createExpensiveGraph(key);
             }
           });


The cost of an Entry in this structure this is computed as follows:

    It's a Cache: +12 words
    It uses maximumSize(): +4 words
    It uses expiration: +4 words

Thus, each (key, value) entry would have a footprint of 20 words (thus 80 bytes in a 32bit VM, or 160 in a 64bit one). 

To estimate the overhead imposed in the garbage collector, one could count how many references (pointers) each entry introduces, which the garbage collector would have to traverse to compute object reachability. The same list again, this time only counting references:

    If you use HashMap or ConcurrentHashMap, the cost is 5 references

OTHER TIPS

Create a program where you create all your objects and store them in a simple array. Measure the used memory (See Runtime).

Then store them in a HashMap. Measure the used memory.

Subtract the first measured memory to the second used memory, and you have the overhead of the HashMap.

Is it significant enough that, for instance, if the data inside is around 256mb that the overhead will matter?

Definitely not. The overhead of 1000 objects in a HashMap isn't even worth worrying about in any event: if they are 256mb each in total, even less so. If the overhead was 256k, which it isn't, that would only be 1%. Not significant.

Is there a reliable way (Aside from a profiler, which I've found are unreliable in some cases) to mathematically calculate what its overhead should be?

Given my answer to (1) the question is moot.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow