Question

In some library code, I have a List that can contain 50,000 items or more.

Callers of the library can invoke methods that result in strings being added to the list. How do I efficiently check for uniqueness of the strings being added?

Currently, just before adding a string, I scan the entire list and compare each string to the to-be-added string. This starts showing scale problems above 10,000 items.

I will benchmark this, but interested in insight.

  • if I replace the List<> with a Dictionary<> , will ContainsKey() be appreciably faster as the list grows to 10,000 items and beyond?
  • if I defer the uniqueness check until after all items have been added, will it be faster? At that point I would need to check every element against every other element, still an n^^2 operation.

EDIT

Some basic benchmark results. I created an abstract class that exposes 2 methods: Fill and Scan. Fill just fills the collection with n items (I used 50,000). Scan scans the list m times (I used 5000) to see if a given value is present. Then I built an implementation of that class for List, and another for HashSet.

The strings used were uniformly 11 characters in length, and randomly generated via a method in the abstract class.

A very basic micro-benchmark.

Hello from Cheeso.Tests.ListTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.4428266
Time to scan: 00:00:13.0291180

Hello from Cheeso.Tests.HashSetTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.3797751
Time to scan: 00:00:00.4364431

So, for strings of that length, HashSet is roughly 25x faster than List , when scanning for uniqueness. Also, for this size of collection, HashSet has zero penalty over List when adding items to the collection.

The results are interesting and not valid. To get valid results, I'd need to do warmup intervals, multiple trials, with random selection of the implementation. But I feel confident that that would move the bar only slightly.

Thanks everyone.

EDIT2

After adding randomization and multple trials, HashSet consistently outperforms List in this case, by about 20x.

These results don't necessarily hold for strings of variable length, more complex objects, or different collection sizes.

Was it helpful?

Solution

You should use the HashSet<T> class, which is specifically designed for what you're doing.

OTHER TIPS

Use HashSet<string> instead of List<string>, then it should scale very well.

From my tests, HashSet<string> takes no time compared to List<string> :)

Possibly off-topic, but if you want to scale very large unique sets of strings (millions+) in a language-independent way, you might check out Bloom Filters.

Does the Contains(T) function not work for you?

I have read that dictionary<> is implemented as an associative array. In some languages (not necessarily anything related to .NET), string indexes are stored as a tree structure that forks at each node based upon the character in the node. Please see http://en.wikipedia.org/wiki/Associative_arrays.

A similar data structure was devised by Aho and Corasick in 1973 (I think). If you store 50,000 strings in such a structure, then it matters not how many strings you are storing. It matters more the length of the strings. If they are are about the same length, then you will likely never see a slow-down in lookups because the search algorithm is linear in run-time with respect to the length of the string you are searching for. Even for a red-black tree or AVL tree, the search run-time depends more upon the length of the string you are searching for rather than the number of elements in the index. However, if you choose to implement your index keys with a hash function, you now incurr the cost of hashing the string (going to be O(m), m = string length) and also the lookup of the string in the index, which will likely be on the order of O(log(n)), n = number of elements in the index.

edit: I'm not a .NET guru. Other more experienced people suggest another structure. I would take their word over mine.

edit2: your analysis is a little off for comparing uniqueness. If you use a hashing structure or dictionary, then it will not be an O(n^2) operation because of the reasoning I posted above. If you continue to use a list, then you are correct that it is O(n^2) * (max length of a string in your set) because you must examine each element in the list each time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top