Pergunta

I'd like to improve the speed performance of my box counting method I use in fractal analysis.

About the task

I have a stream of ints (about n=2^24 long) and I have to calculate how many different values are in the stream. There's no upper boundary and negative values are allowed (but the amount of negative values is possibly less than sqrt(n)). There's a small correlation in the stream, i.e. the actual element is likely to be equal or not too far from the previous one. In many cases I have a lot of equal values in the whole range.

Methods I've already tried

vector, sort, uniqe

My first implementation was to put all the elements into a vector and then I applied a std::sort and then std::unique.

The complexity of this method is O(n*log(n)) and I don't think any other algorithm can be faster in general when it comes to scaling. But I'm sure a code must be exists that is faster than this but with the same scaling properties - so faster only with a constant factor. The reasons are:

  1. I have a lot of equal values stored in the vector so sorting is not as effective, the vector is unduly large
  2. In this method I don't use the information that the actual element and the previous are close to each other
  3. I don't need information on what are those unique values, I only need the number of different elements

set, insert, size

To eliminate the first ineffectiveness point, I put every elements into a set with set::insert. And in the end I counted the number of elements with set::size.

My expectation was that this code must be faster because only unique values are stored in the set and it don't have to compare new elements with a lot of equal values. But unfortunately this method was 1.5 times slower than the previous one.

set, emplace_hint, size

To eliminates the second ineffectiveness point too, I not only put every elements into a set, but with the function set::emplace_hint. And every time a gave a hint to put the new element next to the previous one. And in the end, I asked for the size of the set with the set::size

My expectation was that this code must faster the previous one because I can guess the value of the new element and it's better than nothing. But unfortunately this method was 5 times slower then the previous one.

The question

Can you suggest any effective method that can calculates the number of different elements (ints) in a stream? Can you optimize the code if it is known that

  1. there's a measurable correlation in the numbers
  2. some numbers are appear repetadly

The target architecture is modern x86 or x86-64 PC processor (with sse, sse2) and only single thread code is suitable. I prefer not to use boost but c++11.

The solution

First, thanks for the many suggestions, patience and understanding and I'm sorry that I cannot test all of the methods and I'm also sure that effectiveness is depends on the details of the stream of ints what I've not provided. However I share the results I've got with VS2013 compiler. (Code is tested under gcc4.7 but not measured.) This topic worth a lot more time to investigate but I've got a solution that fits my needs. Time statistics for different methods

About the methods:

  • vector of bool: the BitVector solution from Dieter Lücking
  • binary lookup: the method suggested by Tony D
  • unordered set: putting simply all elements into an std::unordered_set, then ask for the number of its elements, as suggested by Ixanezis
  • vector insert sorted: using Dieter Lücking's Sorted Vector approach
  • set insert: the method I've described in the question form
  • radix sort: Ixanezis's suggestion, using a popular sorting algorithm on vector
  • set emplace hint: using std::emplace_hint as described in the question form
Foi útil?

Solução 2

Just comparing different approaches (Not considering radix sort):

#include <algorithm>
#include <deque>
#include <iostream>
#include <unordered_set>
#include <set>
#include <vector>
#include <chrono>

template <template <typename ...> class Container, typename T, typename ... A, typename Comp>
inline bool insert_sorted(Container<T, A...>& container, T const& e, Comp const& comp) {
    auto const it = std::lower_bound(container.begin(), container.end(), e, comp);
    if (it != container.end() and not comp(e, *it)) { return false; }
    container.insert(it, e);
    return true;
}

template <template <typename ...> class Container, typename T, typename ... A>
inline bool insert_sorted(Container<T, A...>& container, T const& e) {
    return insert_sorted(container, e, std::less<T>{});
}

int main() {
    using namespace std::chrono;
    typedef std::vector<int> data_type;

    const unsigned Size = unsigned(1) << 24;
    const unsigned Limit = 1000;
    data_type data;
    data.reserve(Size);
    for(unsigned i = 0; i < Size; ++i) {
        int value = double(Limit) * std::rand() / RAND_MAX - 0.1;
        data.push_back(value);
        while(i < Size - 1 && rand() < RAND_MAX * 0.25) {
            data.push_back(value);
            ++i;
        }
    }

    std::cout
        << "Data\n"
        << "====\n"
        << "                Size of data: " << Size << '\n';

    std::cout
        << "Unorderd Set\n"
        << "============\n";
    {
        auto start = system_clock::now();

        typedef std::unordered_set<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            set.insert(data[i]);
        }
        if(i < Size)
            set.insert(data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Set\n"
        << "===\n";
    {
        auto start = system_clock::now();

        typedef std::set<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            set.insert(data[i]);
        }
        if(i < Size)
            set.insert(data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Sorted Vector\n"
        << "=============\n";
    {
        auto start = system_clock::now();

        typedef std::vector<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            insert_sorted(set, data[i]);
        }
        if(i < Size)
            insert_sorted(set, data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "BitVector\n"
        << "=========\n";
    {
        auto start = system_clock::now();

        typedef std::vector<bool> set_type;
        set_type set(Limit);
        unsigned i = 0;
        unsigned elements = 0;
        for( ; i < Size; ++i) {
            if( ! set[data[i]]) {
                set[data[i]] = true;
                ++elements;
            }
        }

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << elements << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Sorted Data\n"
        << "===========\n";
    {
        auto start = system_clock::now();

        std::sort(data.begin(), data.end());
        auto last = std::unique(data.begin(), data.end());

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << last - data.begin() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    return 0;
}

Compiled with g++ -std=c++11 -O3 gives:

Data
====
                Size of data: 16777216
Unorderd Set
============
Number of different elements: 1000
                      Timing: 0.269752
Set
===
Number of different elements: 1000
                      Timing: 1.23478
Sorted Vector
=============
Number of different elements: 1000
                      Timing: 1.13783
BitVector
=========
Number of different elements: 1000
                      Timing: 0.038408
Sorted Data
===========
Number of different elements: 1000
                      Timing: 1.32827

Hence if memory is no issue or the range of numbers is limited setting a bit is the best choice. Otherwise, unordered_set is a good one.

Outras dicas

Since you only deal with a limited range of integer numbers, the radix sort algorithm can be effectively employed here, reducing the log(N) part of the complexity. You can pick any really fast implementation somewhere in the internet. Some of them require SSE support, others are multithreaded or even coded to be run on GPU.

If you have the possibility to use boost::unordered_set or C++11 std::unordered_set, then your second approach can be easily modified you use it, also resulting in a linear complexity algorithm. However, if you have at least a few millions of numbers in a stream, I believe the first method would be faster.

Assuming 32-bit ints, worst case scenario is that you need 2^32 bits to track the seen/not-seen status of each number you might see. That's 4 billion bits, or 512 million bytes - 512 megabytes - not prohibitive for a modern desktop computer. You can basically index to a byte [n/8] into the array, then bitwise-and or -or with 1 << (n % 8) to set or test the number's seen state. Because you say the numbers close in the input tend to be close together in value, the cache utilisation should be pretty good. You could check for numbers just seen and bypass the bit-array processing.

If you happen to know that you have less than 2^32 distinct numbers to track in the input, you should of course reduce the size of the bit set accordingly. (Just read your comment "Negative numbers are allowed but it's very rare (possibility is less than 1/n)." - in that case, you could use a set for negative numbers, and use half the memory for the positives).

(If you're worried about a final iteration over many memory pages that may have no bits set at all, you could create an additional "dirty page" index with a bit per page to guide such iteration, but given the quantity of input, if that input's spread wildly across int's numeric range that may be insignificant or even counter-productive.)

EDIT /- further explanation as requested in comment. First, implementation:

template <size_t N>
class Tracker
{
  public:
    Tracker() { std::fill_n(&data_[0], words, 0); }
    void set(int n) { data_[n / 32] |= (1u << (n % 8)); }
    bool test(int n) const { return data_[n / 32] &= (1u << (n % 8)); }

    template <typename Visitor>
    void visit(Visitor& visitor)
    {
        for (size_t word = 0, word_n = 0; word < words; ++word, word_n += 32)
             if (data_[word])
                  for (uint32_t n = 0, value = 1; n < 32; ++n, value *= 2)
                      if (data_[word] & value)
                          visitor(word_n + n);
    }
  private:
    static const int words = N / 32 + (N % 32 ? 1 : 0);
    uint32_t data_[words];
};

Usage:

Tracker<size_t(std::numeric_limits<int>::max()) + 1> my_tracker;
int n;
while (std::cin >> n)
    my_tracker.set(n);
my_tracker.visit([](unsigned n) { std::cout << n << '\n'; });

(not tested... probably a few little issues)

Can you explain your answer more detailed please?

All this does is create what's conceptually just a bool have_seen[] array that can be directly indexed by any integer you're interested in: you just go through the input setting the boolean elements at indices you see in the input to true. If you set something to true two or more times - who cares? Purely to save memory and get speed for searching for set bits (and e.g. fill/clear), it's manually packing the bool values into bits in a larger integral data type.

I think I can bother with negative values because I can calculate the largest and smallest values for a total cost of O(n).

Well, maybe, but it might be faster or slower to do two passes. With the approach I've documented, you don't need to go over the data twice... you can be preparing the answer during the first iteration. Of course, if doing an initial iteration is fast (e.g. from SSD media), and you're tight enough on memory that you want to do the actual analysis for only the actual data range, then go for it.

It also facilitates to shrink the width of the int into the correct value, so more than half of the pages will be non-empty.

Not sure what you mean there.

the standard data structure for this task is a hash set, aka std::unordered_set in stl (btw, google's dense_hash_set usually performs slightly better)

you do not need the unique values to be sorted, so std::set is uncessessarily slow for your use-case.

Like others suggest, you can also use a bitvector if your universe (possible values) is not too big If you have negative values, you can just cast to unsigned and treat them as really big numbers.

Would it help to just compare the current element to the previous before passing it to the counting method, whatever it is ?

Or keep a small / fast cache of the, say last 10 elements, to discard short-range duplicates ?

Or do the counting in batches (count on a sequence of 100 with temporary counters, then merge with the previous counts) ?

I approve trying to use STL containers (set, unordered_set, ...) but unfortunately you pay a penalty for them: their memory stability and lightweight iterators requirements have required them to be implemented as node-based containers, with a huge (relatively speaking) overhead for each and every element.

I would propose two methods instead:

  1. Stick to vector (only efficient with a low proportion of unique items)
  2. Implement Robin-Hood hashing
  3. Use a probabilistic approach

Sorted Vector

For the vector approach: nothing prevents you to keep the vector sorted as you insert, and thus avoid inserting duplicate elements. An example here:

#include <iostream>

#include <algorithm>
#include <vector>

template <typename T, typename Comp>
void insert_sorted(std::vector<T>& vec, T const& e, Comp const& comp) {
    auto const it = std::lower_bound(vec.begin(), vec.end(), e, comp);

    if (it != vec.end() and not comp(e, *it)) { return; }

    vec.insert(it, e);
}

template <typename T>
void insert_sorted(std::vector<T>& vec, T const& e) {
    insert_sorted(vec, e, std::less<T>{});
}

int main() {
    int const array[] = { 4, 3, 6, 2, 3, 6, 8, 4 };

    std::vector<int> vec;
    for (int a: array) {
        insert_sorted(vec, a);
    }

    std::cout << vec.size() << ":";
    for (int a: vec) { std::cout << " " << a; }
    std::cout << "\n";

    return 0;
}

Displays: 5: 2 3 4 6 8.

This is still O(n log n), obviously, however it requires less memory:

  • less memory, more of the vector is in cache
  • the previous element being close to its successor is exploited by the binary search of lower_bound going through nearly the same cache lines

It should already be a great improvement.

Note: it has been pointed out that insertion in the middle of a vector was not efficient. Of course, since it involves shuffling half the already present elements (in average). Still benchmark suggests it can beat the current vector solution when the number of unique elements is small (0.1% is my benchmark).


Robin Hood Hashing

More involved, but Robin Hood hashing has very good characteristics, and thus performance. Most notably, it's implemented on top of a single dynamic array (like vector) so exhibits good memory locality.

Rust switched to Robin Hood hashing for its default implementation of hash tables and is very happy with it.

Note: From quick benchmarks even unordered_set beats the pants off store & store, and a naive open-addressed hash-table is 25% faster.


Probabilistic Approach

For very large problems a very well known algorithm is HyperLogLog. It's been implemented in Redis recently.

It has a very good ratio of used memory vs error rate, and is relatively simple to implement (especially following Antirez' code).


Throwing more hardware at the problem

Note that this is an embarrassingly parallel issue, you could thus easily have multiple threads each:

  • picking a bundle of IDs from the stream (say: 2**10 at once)
  • merging them into a thread-local set of unique IDs (whatever its implementation)
  • looping until the stream is empty
  • and finally merging their results together

And you could get a speed-up close to the number of threads (there is some overhead, obviously).

Note: not too coincidentally, the two approaches can be easily adapted with this method, and both support efficient merges.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top