Preventing double hash operation when trying to update value in a of type Dictionary<IComparable, int>

https://stackoverflow.com/questions/11728546

23-06-2021
|

Question

I am working on software for scientific research that deals heavily with chemical formulas. I keep track of the contents of a chemical formula using an internal Dictionary<Isotope, int> where Isotope is an object like "Carbon-13", "Nitrogen-14", and the int represents the number of those isotopes in the chemical formula. So the formula C2H3NO would exist like this:

{"C12", 2
"H1", 3
"N14", 1
"O16", 1}

This is all fine and dandy, but when I want to add two chemical formulas together, I end up having to calculate the hash function of Isotope twice to update a value, see follow code example.

public class ChemicalFormula {
    internal Dictionary<Isotope, int> _isotopes = new Dictionary<Isotope, int>();

    public void Add(Isotope isotope, int count)
    {
        if (count != 0)
        {
            int curValue = 0;
            if (_isotopes.TryGetValue(isotope, out curValue))
            {
                int newValue = curValue + count;
                if (newValue == 0)
                {
                    _isotopes.Remove(isotope);
                }
                else
                {
                    _isotopes[isotope] = newValue;
                }
            }
            else
            {
                _isotopes.Add(isotope, count);
            }
            _isDirty = true;
        }
    }
}

While this may not seem like it would be a slow down, it is when we are adding billions of chemical formulas together, this method is consistently the slowest part of the program (>45% of the running time). I am dealing with large chemical formulas like "H5921C3759N1023O1201S21" that are consistently being added to by smaller chemical formulas.

My question is, is there a better data structure for storing data like this? I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.

EDIT Isotope is immutable and shouldn't change during the lifetime of the program so I should be able to cache the hashcode.

I have linked to the source code so you can see the classes more in depth rather than me copy and paste them here.

Solution

I second the opinion that Isotope should be made immutable with precalculated hash. That would make everything much simpler.

(in fact, functionally-oriented programming is better suited for calculations of such sort, and it deals with immutable objects)

OTHER TIPS

I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.

Well it would stop the double hashing... but obviously it's then worse in terms of space. What performance difference did you notice?

Another option you should strongly consider if you're doing this a lot is caching the hash within the Isotope class, assuming it's immutable. (If it's not, then using it as a dictionary key is at least somewhat worrying.)

If you're likely to use most Isotope values as dictionary keys (or candidates) then it's probably worth computing the hash during initialization. Otherwise, pick a particularly unlikely hash value (in an ideal world, that would be any value) and use that as the "uncached" value, and compute it lazily.

If you've got 45% of the running time in GetHashCode, have you looked at optimizing that? Is it actually GetHashCode, or Equals which is the problem? (You talk about "hashing" but I suspect you mean "hash lookup in general".)

If you could post the relevant bits of the Isotope type, we may be able to help more.

EDIT: Another option to consider if you're using .NET 4 would be ConcurrentDictionary, with its AddOrUpdate method. You'd use it like this:

public void Add(Isotope isotope, int count)
{
    // I prefer early exit to lots of nesting :)
    if (count == 0)
    {
        return 0;
    }

    int newCount = _isotopes.AddOrUpdate(isotope, count, 
                                         (key, oldCount) => oldCount + count);
    if (newCount == 0)
    {
        _isotopes.Remove(isotope);
    }
    _isDirty = true;
}

Do you actually require random access to Isotope count by type or are you using the dictionary as a means for associating a key with a value?

I would guess the latter.

My suggestion to you is not to work with a dictionary but with a sorted array (or List) of IsotopeTuples, something like:

class IsotopeTuple{
   Isotope i;
   int count;
}

sorted by the name of the isotope.

Why the sorting?

Because then, when you want to "add" two isotopes together, you can do this in linear time by traversing both arrays (hope this is clear, I can elaborate if needed). No hash computation required, just super fast comparisons of order.

This is a classic approach when dealing with vector multiplications where the dimensions are words. Used widely in text mining.

the tradeoff is of course that the construction of the initial vector is (n)log(n), but I doubt if you will feel the impact.

Another solution that you could think of if you had a limited number of Isotopes and no memory problems:

public struct Formula
{
   public int C12;
   public int H1;
   public int N14;
   public int O16;
}

I am guessing you're looking at organic chemistry, so you may not have to deal with that many isotopes, and if the lookup is the issue, this one will be pretty fast...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow