Quickly creating 32 bit hash code uniquely identifying a struct composed of (mostly) primitive values

https://stackoverflow.com/questions/10340628

03-06-2021
|

Question

EDIT: 64 or 128 bit would also work. My brain just jumped to 32bit for some reason, thinking it would be sufficient.

I have a struct that is composed of mostly numeric values (int, decimal), and 3 strings that are never more than 12 alpha-characters each. I'm trying to create an integer value that will work as a hash code, and trying to create it quickly. Some of the numeric values are also nullable.

It seems like BitVector32 or BitArray would be useful entities for use in this endevor, but I'm just not sure how to bend them to my will in this task. My struct contains 3 strings, 12 decimals (7 of which are nullable), and 4 ints.

To simplify my use case, lets say you have the following struct:

public struct Foo
{
    public decimal MyDecimal;
    public int? MyInt;
    public string Text;
}

I know I can get numeric identifiers for each value. MyDecimal and MyInt are of course unique, from a numerical standpoint. And the string has a GetHashCode() function which will return a usually-unique value.

So, with a numeric identifier for each, is it possible to generate a hash code that uniquely identifies this structure? e.g. I can compare 2 different Foo's containing the same values, and get the same Hash Code, every time (regardless of app domain, restarting the app, time of day, alignment of Jupiters moons, etc).

The hash would be sparse, so I don't anticipate collisions from my use cases.

Any ideas? My first run at it I converted everything to a string representation, concated it, and used the built-in GetHashCode() but that seems terribly ... inefficient.

EDIT: A bit more background information. The structure data is being delivered to a webclient, and the client does a lot of computation of included values, string construction, etc to re-render the page. The aforementioned 19 field structure represent a single unit of information, each page could have many of units. I'd like to do some client-side caching of the rendered result, so I can quickly re-render a unit without recomputing on the client side if I see the same hash identifier from the server. JavaScript numeric values are all 64 bit, so I suppose my 32bit constraint is artificial and limiting. 64 bit would work, or I suppose even 128 bit if I can break it into two 64 bit values on the server.

Solution

Well, even in a sparse table one should better be prepared for collisions, depending on what "sparse" means.

Hash collisions probability (uniform distribution)

You would need to be able to make very specific assumptions about the data you will be hashing at the same time to beat this graph with 32 bits.

Go with SHA256. Your hashes will not depend on CLR version and you will have no collisions. Well, you will still have some, but less frequently than meteorite impacts, so you can afford not anticipating any.

OTHER TIPS

Two things I suggest you take a look at here and here. I don't think you'll be able to GUARANTEE no collisions with just 32 bits.

Hash codes by definition of a hash function are not meant to be unique. They are only meant to be as evenly distributed across all result values as possible. Getting a hash code for an object is meant to be a quick way to check if two objects are different. If hash codes for two objects are different then those objects are different. But if hash codes are the same you have to deeply compare the objects to be be sure. Hash codes main usage is in all hash-based collections where they make it possible for nearly O(1) retrieval speed.

So in this light, your GetHashCode does not have to be complex and in fact it shouldn't. It must be balanced between being very quick and producing evenly distributed values. If it takes too long to get a hash code it makes it pointless because advantage over deep compare is gone. If on the other extreme end, hash code would always be 1 for example (lighting fast) it would lead to deep compare in every case which makes this hash code pointless too.

So get the balance right and don't try to come up with a perfect hash code. Call GetHashCode on all (or most) of your members and combine the results using Xor operator maybe with a bitwise shift operator << or >>. Framework types have GetHashCode quite optimized although they are not guaranteed to be the same in each application run. There is no guarantee but they also do not have to change and a lot of them don't. Use a reflector to make sure or create your own versions based on the reflected code.

In your particular case deciding if you have already processed a structure by just looking at its hash code is a bit risky. The better the hash the smaller the risk but still. The ultimate and only unique hash code is... the data itself. When working with hash codes you must also override Object.Equals for your code to be truly reliable.

I believe the usual method in .NET is to call GetHashCode on each member of the structure and xor the results.

However, I don't think GetHashCode claims to produce the same hash for the same value in different app domains.

Could you give a bit more information in your question about why you want this hash value and why it needs to be stable over time, different app domains etc.

What goal are you after? If it is performance then you should use a class since a struct will be copied by value whenever you pass it as a function parameter.

3 strings, 12 decimals (7 of which are nullable), and 4 ints.

On a 64 bit machine a pointer will be 8 bytes in size a decimal takes 16 bytes and an int 4 bytes. Ignoring padding your struct will use 232 bytes per instance. This is much bigger compared to the recommened maximum of 16 bytes which makes sense perf wise (classes take up at least 16 bytes due to its object header, ...)

If you need a fingerprint of the value you can use a cryptographically grade hash algo like SHA256 which will produce a 16 byte fingerprint. This is still not uniqe but at least unique enough. But this will cost quite some performance as well.

Edit1: After you made clear that you need the hash code to identify the object in a Java Script web client cache I am confused. Why does the server send the same data again? Would it not be simpler to make the server smarter to send only data the client has not yet received?

A SHA hash algo could be ok in your case to create some object instance tag.

Why do you need a hash code at all? If your goal is to store the values in a memory efficient manner you can create a FooList which uses dictionaries to store identical values only once and uses and int as lookup key.

using System;
using System.Collections.Generic;

namespace MemoryEfficientFoo
{
    class Foo // This is our data structure 
    {
        public int A;
        public string B;
        public Decimal C;
    }

    /// <summary>
    /// List which does store Foos with much less memory if many values are equal. You can cut memory consumption by factor 3 or if all values 
    /// are different you consume 5 times as much memory as if you would store them in a plain list! So beware that this trick
    /// might not help in your case. Only if many values are repeated it will save memory.
    /// </summary>
    class FooList : IEnumerable<Foo> 
    {
        Dictionary<int, string> Index2B = new Dictionary<int, string>();
        Dictionary<string, int> B2Index = new Dictionary<string, int>();

        Dictionary<int, Decimal> Index2C = new Dictionary<int, decimal>();
        Dictionary<Decimal,int> C2Index = new Dictionary<decimal,int>();

        struct FooIndex
        {
            public int A;
            public int BIndex;
            public int CIndex;
        }

        // List of foos which do contain only the index values to the dictionaries to lookup the data later.
        List<FooIndex> FooValues = new List<FooIndex>();

        public void Add(Foo foo)
        {
            int bIndex;
            if(!B2Index.TryGetValue(foo.B, out bIndex))
            {
                bIndex = B2Index.Count;
                B2Index[foo.B] = bIndex;
                Index2B[bIndex] = foo.B;
            }

            int cIndex;
            if (!C2Index.TryGetValue(foo.C, out cIndex))
            {
                cIndex = C2Index.Count;
                C2Index[foo.C] = cIndex;
                Index2C[cIndex] = cIndex;
            }

            FooIndex idx = new FooIndex
            {
                A = foo.A,
                BIndex = bIndex,
                CIndex = cIndex
            };

            FooValues.Add(idx);
        }

        public Foo GetAt(int pos)
        {
            var idx = FooValues[pos];
            return new Foo
            {
                A = idx.A,
                B = Index2B[idx.BIndex],
                C = Index2C[idx.CIndex]
            };
        }

        public IEnumerator<Foo> GetEnumerator()
        {
            for (int i = 0; i < FooValues.Count; i++)
            {
                yield return GetAt(i);
            }
        }
        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }


    class Program
    {
        static void Main(string[] args)
        {
            FooList list = new FooList();
            List<Foo> fooList = new List<Foo>();
            long before = GC.GetTotalMemory(true);
            for (int i = 0; i < 1000 * 1000; i++)
            {
                list
                //fooList
                    .Add(new Foo
                    {
                        A = i,
                        B = "Hi",
                        C = i
                    });

            }
            long after = GC.GetTotalMemory(true);
            Console.WriteLine("Did consume {0:N0}bytes", after - before);
        }
    }
}

A similar memory conserving list can be found here

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow