When you read a large data structure (larger than 32 KB) how you read that data structure impact performance.
These are the typical sizes and speeds of you caches.
L1: 32 KB, 4 clock cycles.
L2: 256 KB, 11 clock cycles.
L3: 3-30 MB, 40-75 clock cycles.
Main memory: up to 2TB, 200-500 clock cycles.
This means cache locality is very important. i.e. if you are reading something from L1 it can be 20x faster than reading from L3.
In your case you are using a Hash data structure. This is designed for random access and random arrangement which unfortunately mean it has very poor cacheability. Access memory randomly and it is likely to be in the slower areas of memory.
However, there is an exception to this. If you access the same data multiple times, e.g. get a key you just got from an iterator, or you are scanning through a collection in order e.g. this is what the iterator does for HashMap (not so much a TreeMap) it is much more likely that the next piece of data you will access is on the same cache line (each cache line is 64-bytes long) or the next one. These type of accesses perform much better as CPU are designed to perform vector operations very fast.
BTW Your working set is the set of keys, if your values are different objects I would expect this to be much slower if you actually look at those objects (as this increases the size of your working set and how much memory would be required to cache it)