map or hashmap for efficient lookup of generic strings?

https://stackoverflow.com/questions/11958065

26-06-2021
|

Вопрос

I need to be able to store and lookup generic strings. I don't know much about the content of the strings, a little more then 2/3 are human language words with the rest being something closer to a UUID or number/letter combo. I know that any particular grouping will be constant (ie if it has some human words it will be all human words, if it has some UUID all the contents will be UUIDs etc).

I need to decide if I should place this data in a map or a hashmap to get the best average lookup rate. I'm inclined to say map with the O(log n) runtime because I don't believe I can make a proper efficient hash for strings when I know so little about their input format. Any thoughts as to which would be better?

EDIT: I forgot one key aspect. I don't know the length of the strings and so am concerned memory usage may grow too lage for long strings. If I used the hash method I would do something where after X characters the hash doesn't hash on a per-character basis to avoid the memory consumption being too huge.

What I would really like is a hash map implementation that keeps multuple values in the 'bucket' sorted in an ordered manaer so it can offer a (log N) search of the buckets; but I don't think that exists in stardrd C++ and it's not worth writeing from scratch.

pps. the data is near-static. which I'll occasionally have to add to the list it's rare and I'm willing to accept a slow write time. I only care about the lookup time.

Решение

It is difficult to make a single recommendation. It depends on several tradeoffs (type of iteration, memory vs lookup). Throughout I assume that you can use a C++11 compiler (or the equivalent Boost or TR1 libraries).

If the insertion/lookup times are the most important to you, I would definitely use std::unordered_set (see reference) with std::hash<std::string> (see reference). Both insertion and lookup are O(1) on average (amortized constant). If

Note that the unordered hash containers do not allow you to do iteration in sorted order. So if you want sorted iteration, then you can use the ordered container std::set<std::string>, but the price you pay is O(log N) lookup/insertion.

Memory constraints are more difficult to analyze. First, the ordered containers std::set and std::map need roughly 3 words per element overhead to maintain a tree structure that allows the ordered iteration. The unordered hash containers, however, have some spare capacity as hash containers operate very poorly on a full load factor.

#include <iostream>
#include <functional>
#include <string>
#include <unordered_set> // or <set> for ordered lookup

int main()
{
    // or std::set<std::string> for ordered lookup
    std::unordered_set<std::string> dictionary; 

    std::string str = "Meet the new boss...";
    dictionary.insert(str);
    auto it = dictionary.find(str);

    std::cout << *it << '\n';
}

Output on Ideone. If you also want to store Value alongside the std::string, then you can use a std::unordered_map<std::string, Value>, or std::map<std::string, Value> with the same hash function.

Conclusion: it is best to measure what works best for your application, depending on the tradeoffs indicated above.

Другие советы

Apart from std::set, std::map, std::unordered_set and std::unordered_map - I would also consider studying Tries to see if they would be a better fit:

http://en.wikipedia.org/wiki/Trie

Cedar, HAT-Trie, and JudyArray is quite awesome, you can find the benchmark here.

benchmark result

You might want to have a look at the benchmark: http://www.dotnetperls.com/sorteddictionary It appears in real application inspite of collisions Dictionary is better than SortedDictionary.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow