Why can't I retrieve an item from a HashSet without enumeration?

https://stackoverflow.com/questions/1494812

18-09-2019
|

Question

I'm looking for insight into the heads of HashSet designers. As far as I am aware, my question applies to both Java and C# HashSets, making me think there must be some good reason for it, though I can't think of any myself.

After I have inserted an item into a HashSet, why is it impossible to retrieve that item without enumeration, hardly an efficient operation? Especially since a HashSet is explicitly built in a way which supports efficient retrieval.

It would often be useful to me to have Remove(x) and Contains(x) return the actual item that is being removed or contained. This is not necessarily the item I pass into the Remove(x) or Contains(x) function. Sure, I guess I could achieve the same effect through a HashMap but why waste all that space and effort when it should be perfectly possible to do this with a set?

I can appreciate that there may be some design concerns that adding this functionality would allows uses of HashSet which are not consistent with their role or future role in the framework, but if this is so, what are these design concerns?

Edit

To answer some more questions, here are more details:

I am using an immutable reference type with overridden hashcode, equals, etc to emulate a value type in C#. Let's say the type has members A, B, and C. Hashcode, equals, etc depend only on A and B. Given some A and B I want to be able to retrieve that equivalent item from a hashset and get it's C. I won't be able to use HashSet for this it appears, but I would at least like to know if there is any good reason for this. Pseudo code follows:

public sealed class X{
 object A;
 object B;
 object extra;

 public int HashCode(){
  return A.hashCode() + B.hashCode();
 }

 public bool Equals(X obj){
  return obj.A == A && obj.B == B;
 }
}

hashset.insert(new X(1,2, extra1));
hashset.contains(new X(1,2)); //returns true, but I can't retrieve extra

Solution

How were you proposing to retrieve the item from the hash set? A set is by definition not ordered in any way and therefore, there is no index with which to use to retrieve the object in question.

Sets, as a concept, are used to test inclusion, i.e. whether or not the element in question is in the hash data set. If you're looking to retrieve a value from a data source using a key value or index, I would suggest looking into either a Map or a List.

EDIT: Additional answer based on the Edit to the original question

Soonil, based on your new information, it looks like you might be interested in implementing your data as a Java Enum, something similar to this:

 public enum SoonilsDataType {
      A, B, C;

      // Just an example of what's possible
      public static SoonilsDataType getCompositeValue(SoonilsDataType item1,
           SoonilsDataType item2) {
           if (item1.equals(A) && 
                     item2.equals(B)) {
                return C;
           }
      }
 }

Enum's automatically inherit values() which returns the list of all values in the enum's "set", which you can use to test inclusion against in the same way as the Set. Also, because its a full class, you can define new static methods to do the composite logic (like I was trying to allude to in the example code). The only thing about the Enum is that you can't add new instances at runtime, which may not be what you want (though if the set's data size isn't going to grow at runtime, the Enum is what you want).

OTHER TIPS

In .Net, what you are probably looking for is KeyedCollection http://msdn.microsoft.com/en-us/library/ms132438.aspx

You can get around the nastiness of re-implementing this abstract class each time with some "generic" cleverness. (See IKeyedObject`1.)

Note: Any data transfer object which implements IKeyedObject`1 should have an overridden GetHashCode method simply returning this.Key.GetHashCode(); and same goes for equals...

My Base Class Library usually ends up with something like this in it:

public class KeyedCollection<TItem> : System.Collections.ObjectModel.KeyedCollection<TItem, TItem>
    where TItem : class
{
    public KeyedCollection() : base()
    {
    }

    public KeyedCollection(IEqualityComparer<TItem> comparer) : base(comparer)
    {
    }

    protected override TItem GetKeyForItem(TItem item)
    {
        return item;
    }
}

public class KeyedObjectCollection<TKey, TItem> : System.Collections.ObjectModel.KeyedCollection<TKey, TItem>
    where TItem : class, IKeyedObject<TKey>
    where TKey : struct
{
    public KeyedCollection() : base()
    {
    }

    protected override TItem GetKeyForItem(TItem item)
    {
        return item.Key;
    }
}

///<summary>
/// I almost always implement this explicitly so the only
/// classes that have access without some rigmarole
/// are generic collections built to be aware that an object
/// is keyed.
///</summary>
public interface IKeyedObject<TKey>
{
    TKey Key { get; }
}

If you change an object after it has been inserted, it's hash may have changed (this is especially likely if hashCode() has been overridden). If the hash changes, a lookup of it in the set will fail, as you will be attempting to lookup an object that is hashed at a different location than it is stored in.

Also, you need to make sure you have overridden hashCode and equals in your object if you want to lookup equal objects that are different instances.

Note that this is all for Java - I am assuming C# has something similar, but as it has been several years since I used C#, I will let others speak to it's capabilities.

I imagine the designers of the Set interface and HashSet class wanted to ensure that the remove(Object) method defined on the Collection interface was also applicable to Set; this method returns a boolean denoting whether the object was successfully removed. If the designers wanted to provide functionality whereby remove(Object) returned the "equal" object already in the Set this would mean a different method signature.

Also, given that the object being removed is logically equal to the object passed to remove(Object) it is arguable about the value added in returning the contained object. However, I have had this problem myself before and have used a Map to solve the problem.

Note that in Java, a HashSet uses a HashMap internally and so there isn't additional storage overhead in using a HashMap instead.

Why not just use a HashMap<X,X>? This does exactly what you want. Just do .put(x,x) every time and then you can just get the stored element equal to x with .get(x).

This was an oversight from the library designers. As I mentioned under another answer, this method has been added to .NET Framework 4.7.2 (and .NET Core 2.0 before it); see HashSet<T>.TryGetValue. Citing the source:

/// <summary>
/// Searches the set for a given value and returns the equal value it finds, if any.
/// </summary>
/// <param name="equalValue">The value to search for.
/// </param>
/// <param name="actualValue">
/// The value from the set that the search found, or the default value
/// of <typeparamref name="T"/> when the search yielded no match.</param>
/// <returns>A value indicating whether the search was successful.</returns>
/// <remarks>
/// This can be useful when you want to reuse a previously stored reference instead of 
/// a newly constructed one (so that more sharing of references can occur) or to look up
/// a value that has more complete data than the value you currently have, although their
/// comparer functions indicate they are equal.
/// </remarks>
public bool TryGetValue(T equalValue, out T actualValue)

Looks to me like you're actually looking for a Map<X,Y>, where Y is the type of extra1.

(rant below)

The equals and hashCode methods define meaningful object equality. The HashSet class assumes that if two objects are equal as defined by Object.equals(Object) there is no difference between these two objects.

I'd go as far as to say that if the object extra is meaningful, your design is not ideal.

SOLVED. Wishing to find an element seems perfectly valid to me, because the representative used for the search may differ from the found element. This is especially true if elements contain key and value information, and a custom equality comparer compares the key part only. See the code example. The code contains a comparer that implements a custom search and that captures the element found. This requires an instance of the comparer. Clear the reference to the found element. Perform a search by means of Contains. Access the found element. Be aware of multithread issues when sharing the comparer instance.

using System;
using System.Collections.Generic;

namespace ConsoleApplication1 {

class Box
{
    public int Id;
    public string Name;
    public Box(int id, string name)
    {
        Id = id;
        Name = name;
    }
}

class BoxEq: IEqualityComparer<Box>
{
    public Box Element;

    public bool Equals(Box element, Box representative)
    {
        bool found = element.Id == representative.Id;
        if (found)
        {
            Element = element;
        }
        return found;
    }

    public int GetHashCode(Box box)
    {
        return box.Id.GetHashCode();
    }
}

class Program
{
    static void Main()
    {
        var boxEq = new BoxEq();
        var hashSet = new HashSet<Box>(boxEq);
        hashSet.Add(new Box(3, "Element 3"));
        var box5 = new Box(5, "Element 5");
        hashSet.Add(box5);
        var representative = new Box(5, "Representative 5");
        boxEq.Element = null;
        Console.WriteLine("Contains {0}: {1}", representative.Id, hashSet.Contains(representative));
        Console.WriteLine("Found id: {0}, name: {1}", boxEq.Element.Id, boxEq.Element.Name);
        Console.WriteLine("Press enter");
        Console.ReadLine();
    }
}

} // namespace

Set objects in those languages were mostly designed as set of value, not for mutable objects. They check that object put in them are unique by using equals. That is why contains and remove returns boolean, not the object: they check for or remove the value you pass to them.

And actually, if you do a contains(X) on a set, and expect to get a different object Y, that would means X and Y are equals (ie X.equals(Y) => true), but somewhat different, which seems wrong.

I was given an interesting suggestion as to a way to use a Map, by having my own objects define themselves as KeyValuePairs. While a good concept, unfortunately KeyValuePair is not an interface (why not?) and is a struct, which shoots that plan out of the air. In the end I will roll my own Set, as my constraints allow me this option.

Short answer; because the items cannot be guaranteed to be immutable.

I've hit the exact problem you describe, where the HashCode is based on fixed fields within the member class, but the class holds additional information that can be updated without changing the hash.

My solution was to implement a generic MyHashSet<T> based on ICollection<T> but wrapped round a Dictionary<int, List<T>> to provide the required lookup efficiency, where the int key is the HashCode of T. However, this shows that if the HashCode of the member objects can change then the dictionary lookup followed by equality comparison of items in the list will never find the changed items. There is no mechanism for forcing the members to be immutable so the only solution is to enumerate the lot.

After wondering the same thing, and finely being able to look at the source code:

source: http://referencesource.microsoft.com/#System.Core/System/Collections/Generic/HashSet.cs

A set is a collection of unique items (objects or values). In the .net implementation an item is the same as another item (not unique) if the Equals method of the comparer returns true for the two items. Not if the two items have the same hash code. so a check of the existence of an item is a two step process. first using the hashset to minimize the number of items to compere, then the compression itself.

If you wish to retrieve an item, you must be able to supply the retrieving function with a unique identifier. you might know the hash code of the item you want. but that is not enough. as more than one item can have that same hash. you will also need to supply the item itself so that the Equal method can be called. and clearly if you have the item there is no reason to get it.

One could create a data structure that demands that no two unique items ever return the same hash code. and than you could get an item from it. it will be faster of adding*, and retrieving will be possible if you know the hash. if two items that are not equal but return the same hash are put into it the first will be overwritten. as far as I know this Type doesn't exist in .net , and no this is not the same as a dictionary.

*given that the GetHash method is the same.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow