Effective unique on unordered elements

Question 1

Just comparing different approaches (Not considering radix sort):

#include <algorithm>
#include <deque>
#include <iostream>
#include <unordered_set>
#include <set>
#include <vector>
#include <chrono>

template <template <typename ...> class Container, typename T, typename ... A, typename Comp>
inline bool insert_sorted(Container<T, A...>& container, T const& e, Comp const& comp) {
    auto const it = std::lower_bound(container.begin(), container.end(), e, comp);
    if (it != container.end() and not comp(e, *it)) { return false; }
    container.insert(it, e);
    return true;
}

template <template <typename ...> class Container, typename T, typename ... A>
inline bool insert_sorted(Container<T, A...>& container, T const& e) {
    return insert_sorted(container, e, std::less<T>{});
}

int main() {
    using namespace std::chrono;
    typedef std::vector<int> data_type;

    const unsigned Size = unsigned(1) << 24;
    const unsigned Limit = 1000;
    data_type data;
    data.reserve(Size);
    for(unsigned i = 0; i < Size; ++i) {
        int value = double(Limit) * std::rand() / RAND_MAX - 0.1;
        data.push_back(value);
        while(i < Size - 1 && rand() < RAND_MAX * 0.25) {
            data.push_back(value);
            ++i;
        }
    }

    std::cout
        << "Data\n"
        << "====\n"
        << "                Size of data: " << Size << '\n';

    std::cout
        << "Unorderd Set\n"
        << "============\n";
    {
        auto start = system_clock::now();

        typedef std::unordered_set<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            set.insert(data[i]);
        }
        if(i < Size)
            set.insert(data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Set\n"
        << "===\n";
    {
        auto start = system_clock::now();

        typedef std::set<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            set.insert(data[i]);
        }
        if(i < Size)
            set.insert(data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Sorted Vector\n"
        << "=============\n";
    {
        auto start = system_clock::now();

        typedef std::vector<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            insert_sorted(set, data[i]);
        }
        if(i < Size)
            insert_sorted(set, data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "BitVector\n"
        << "=========\n";
    {
        auto start = system_clock::now();

        typedef std::vector<bool> set_type;
        set_type set(Limit);
        unsigned i = 0;
        unsigned elements = 0;
        for( ; i < Size; ++i) {
            if( ! set[data[i]]) {
                set[data[i]] = true;
                ++elements;
            }
        }

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << elements << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Sorted Data\n"
        << "===========\n";
    {
        auto start = system_clock::now();

        std::sort(data.begin(), data.end());
        auto last = std::unique(data.begin(), data.end());

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << last - data.begin() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    return 0;
}

Compiled with g++ -std=c++11 -O3 gives:

Data
====
                Size of data: 16777216
Unorderd Set
============
Number of different elements: 1000
                      Timing: 0.269752
Set
===
Number of different elements: 1000
                      Timing: 1.23478
Sorted Vector
=============
Number of different elements: 1000
                      Timing: 1.13783
BitVector
=========
Number of different elements: 1000
                      Timing: 0.038408
Sorted Data
===========
Number of different elements: 1000
                      Timing: 1.32827

Hence if memory is no issue or the range of numbers is limited setting a bit is the best choice. Otherwise, unordered_set is a good one.

Question 2

Since you only deal with a limited range of integer numbers, the radix sort algorithm can be effectively employed here, reducing the log(N) part of the complexity. You can pick any really fast implementation somewhere in the internet. Some of them require SSE support, others are multithreaded or even coded to be run on GPU.

If you have the possibility to use boost::unordered_set or C++11 std::unordered_set, then your second approach can be easily modified you use it, also resulting in a linear complexity algorithm. However, if you have at least a few millions of numbers in a stream, I believe the first method would be faster.

Question 3

Assuming 32-bit ints, worst case scenario is that you need 2^32 bits to track the seen/not-seen status of each number you might see. That's 4 billion bits, or 512 million bytes - 512 megabytes - not prohibitive for a modern desktop computer. You can basically index to a byte [n/8] into the array, then bitwise-and or -or with 1 << (n % 8) to set or test the number's seen state. Because you say the numbers close in the input tend to be close together in value, the cache utilisation should be pretty good. You could check for numbers just seen and bypass the bit-array processing.

If you happen to know that you have less than 2^32 distinct numbers to track in the input, you should of course reduce the size of the bit set accordingly. (Just read your comment "Negative numbers are allowed but it's very rare (possibility is less than 1/n)." - in that case, you could use a set for negative numbers, and use half the memory for the positives).

(If you're worried about a final iteration over many memory pages that may have no bits set at all, you could create an additional "dirty page" index with a bit per page to guide such iteration, but given the quantity of input, if that input's spread wildly across int's numeric range that may be insignificant or even counter-productive.)

EDIT /- further explanation as requested in comment. First, implementation:

template <size_t N>
class Tracker
{
  public:
    Tracker() { std::fill_n(&data_[0], words, 0); }
    void set(int n) { data_[n / 32] |= (1u << (n % 8)); }
    bool test(int n) const { return data_[n / 32] &= (1u << (n % 8)); }

    template <typename Visitor>
    void visit(Visitor& visitor)
    {
        for (size_t word = 0, word_n = 0; word < words; ++word, word_n += 32)
             if (data_[word])
                  for (uint32_t n = 0, value = 1; n < 32; ++n, value *= 2)
                      if (data_[word] & value)
                          visitor(word_n + n);
    }
  private:
    static const int words = N / 32 + (N % 32 ? 1 : 0);
    uint32_t data_[words];
};

Usage:

Tracker<size_t(std::numeric_limits<int>::max()) + 1> my_tracker;
int n;
while (std::cin >> n)
    my_tracker.set(n);
my_tracker.visit([](unsigned n) { std::cout << n << '\n'; });

(not tested... probably a few little issues)

Can you explain your answer more detailed please?

All this does is create what's conceptually just a bool have_seen[] array that can be directly indexed by any integer you're interested in: you just go through the input setting the boolean elements at indices you see in the input to true. If you set something to true two or more times - who cares? Purely to save memory and get speed for searching for set bits (and e.g. fill/clear), it's manually packing the bool values into bits in a larger integral data type.

I think I can bother with negative values because I can calculate the largest and smallest values for a total cost of O(n).

Well, maybe, but it might be faster or slower to do two passes. With the approach I've documented, you don't need to go over the data twice... you can be preparing the answer during the first iteration. Of course, if doing an initial iteration is fast (e.g. from SSD media), and you're tight enough on memory that you want to do the actual analysis for only the actual data range, then go for it.

It also facilitates to shrink the width of the int into the correct value, so more than half of the pages will be non-empty.

Not sure what you mean there.

Question 4

the standard data structure for this task is a hash set, aka std::unordered_set in stl (btw, google's dense_hash_set usually performs slightly better)

you do not need the unique values to be sorted, so std::set is uncessessarily slow for your use-case.

Like others suggest, you can also use a bitvector if your universe (possible values) is not too big If you have negative values, you can just cast to unsigned and treat them as really big numbers.

Question 5

Would it help to just compare the current element to the previous before passing it to the counting method, whatever it is ?

Or keep a small / fast cache of the, say last 10 elements, to discard short-range duplicates ?

Or do the counting in batches (count on a sequence of 100 with temporary counters, then merge with the previous counts) ?

Question 6

I approve trying to use STL containers (set, unordered_set, ...) but unfortunately you pay a penalty for them: their memory stability and lightweight iterators requirements have required them to be implemented as node-based containers, with a huge (relatively speaking) overhead for each and every element.

I would propose two methods instead:

Stick to vector (only efficient with a low proportion of unique items)
Implement Robin-Hood hashing
Use a probabilistic approach

Sorted Vector

For the vector approach: nothing prevents you to keep the vector sorted as you insert, and thus avoid inserting duplicate elements. An example here:

#include <iostream>

#include <algorithm>
#include <vector>

template <typename T, typename Comp>
void insert_sorted(std::vector<T>& vec, T const& e, Comp const& comp) {
    auto const it = std::lower_bound(vec.begin(), vec.end(), e, comp);

    if (it != vec.end() and not comp(e, *it)) { return; }

    vec.insert(it, e);
}

template <typename T>
void insert_sorted(std::vector<T>& vec, T const& e) {
    insert_sorted(vec, e, std::less<T>{});
}

int main() {
    int const array[] = { 4, 3, 6, 2, 3, 6, 8, 4 };

    std::vector<int> vec;
    for (int a: array) {
        insert_sorted(vec, a);
    }

    std::cout << vec.size() << ":";
    for (int a: vec) { std::cout << " " << a; }
    std::cout << "\n";

    return 0;
}

Displays: 5: 2 3 4 6 8.

This is still O(n log n), obviously, however it requires less memory:

less memory, more of the vector is in cache
the previous element being close to its successor is exploited by the binary search of lower_bound going through nearly the same cache lines

It should already be a great improvement.

Note: it has been pointed out that insertion in the middle of a vector was not efficient. Of course, since it involves shuffling half the already present elements (in average). Still benchmark suggests it can beat the current vector solution when the number of unique elements is small (0.1% is my benchmark).

Robin Hood Hashing

More involved, but Robin Hood hashing has very good characteristics, and thus performance. Most notably, it's implemented on top of a single dynamic array (like vector) so exhibits good memory locality.

Rust switched to Robin Hood hashing for its default implementation of hash tables and is very happy with it.

Note: From quick benchmarks even unordered_set beats the pants off store & store, and a naive open-addressed hash-table is 25% faster.

Probabilistic Approach

For very large problems a very well known algorithm is HyperLogLog. It's been implemented in Redis recently.

It has a very good ratio of used memory vs error rate, and is relatively simple to implement (especially following Antirez' code).

Throwing more hardware at the problem

Note that this is an embarrassingly parallel issue, you could thus easily have multiple threads each:

picking a bundle of IDs from the stream (say: 2**10 at once)
merging them into a thread-local set of unique IDs (whatever its implementation)
looping until the stream is empty
and finally merging their results together

And you could get a speed-up close to the number of threads (there is some overhead, obviously).

Note: not too coincidentally, the two approaches can be easily adapted with this method, and both support efficient merges.

Effective unique on unordered elements

About the task

Methods I've already tried

vector, sort, uniqe

set, insert, size

set, emplace_hint, size

The question

The solution