what are the fast algorithms to find duplicate elements in a collection and group them?

https://stackoverflow.com/questions/1332527

19-09-2019
|

Question

Say you have a collection of elements, how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++, but algorithm is more important than the language. For Example given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}. what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap

Updates

To make things clearer as suggested. there's one constraint: the elements must be compared by themselves to be certain they are duplicates.

So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket.

Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.

No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.

What I am doing is to

represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than, say, a vector 2) its insertion is cheaper, not requiring sort.)
pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.
repeat the above 2 steps until the list is empty.

It needs N-1 in the best case, but (N-1)! in the worse case.

what are the better alternatives?

My code using method explained above:

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};

Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs), and the equal predicates are self-defined, that is the comparison can be very heavy.

So maybe my question should be: using which data structure + algorithm involves fewer comparisons.

Using a pre-sorted container like multiset, multimap is not better according to my test, since

the sorting while inserting already does the comparisons,
the following adjacent finding does comparison again,
these data structure prefer less-than operations to equal operations, they perform 2 less-than(a

I do not see how it can save comparisons.

one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container.

Conclusion

After all the discussion, there seem to be 3 ways

my original naive method as explained above
Start with a linear container like std::vector , sort it and then locate the equal ranges
start with an associated container like std::map<Type, vector<duplicates>>, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.

I've coded a sample to test all the methods as posted below.

the number of duplicates and when they are distributed may influence the best choice.

Method 1 is best when they fall heavily at the front, and is worst when at the end. Sort will not change the distribution, but the endian.
Method 3 has the most average performance
Method 2 is never the best choice

Thanks for all who participating in the discussion.

one output with 20 sample items from the code below.

Test with [ 20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 ]

and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ] respectively

using std::vector -> sort() -> adjacent_find():

comparisons: [ '<' = 139, '==' = 23 ]

comparisons: [ '<' = 38, '==' = 23 ]

using std::list -> sort() -> shrink list:

comparisons: [ '<' = 50, '==' = 43 ]

comparisons: [ '<' = 52, '==' = 43 ]

using std::list -> shrink list:

comparisons: [ '<' = 0, '==' = 121 ]

comparisons: [ '<' = 0, '==' = 43 ]

using std::vector -> std::map>:

comparisons: [ '<' = 79, '==' = 0 ]

comparisons: [ '<' = 53, '==' = 0 ]

using std::vector -> std::multiset -> adjacent_find():

comparisons: [ '<' = 79, '==' = 7 ]

comparisons: [ '<' = 53, '==' = 7 ]

Code

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

Solution

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

Iterating through the groups is then (relatively) simple and cheap:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

1885879088 comparisons, 100000 copies

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

Edit

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

OTHER TIPS

Yes, you can do much better.

Sort them (O(n) for simple integers, O(n*log n) in general), then duplicates are guaranteed to be adjacent, making finding them quick O(n)
Use a hash table, also O(n). For each item, (a) check if it's already in the hash table; if so, its a duplicate; if not, put it in the hash table.

edit

The method you're using seems to do O(N^2) comparisons:

for i = 0; i < length; ++i           // will do length times
    for j = i+1; j < length; ++j     // will do length-i times
        compare

So for length 5 you do 4+3+2+1=10 compares; for 6 you do 15, etc. (N^2)/2 - N/2 to be exact. N*log(N) is smaller, for any reasonably high value of N.

How big is N in your case?

As far as reducing hash collisions, the best way is to get a better hash function :-D. Assuming that is not possible, if you can make a variant (e.g., different modulous), you may be able to do a nested hash.

1. Sort the array O(n log n) at worst - mergesort/heapsort/binary tree sort etc

2. Compare neighbors and pull the matches out O(n)

Keep a hash-table based structure from value to count; if your C++ implementation doesn't offer std::hash_map (not really part of the C++ standard so far!-) use Boost or grab a version off the web. One pass over the collection (i.e., O(N)) lets you do a value->count mapping; one more pass over the hash table (<= O(N), clearly) to identify values with a count > 1 and emit them appropriately. Overall O(N), which is not the case for your suggestion.

Have you tried sorting? For example using an algorithm like quick sort? If the performance is good enough for you then that would be an easy approach.

If it is known to be a list of integers, and if it is known that they are all between A and B (e.g. A=0, B=9), make an array of B-A elements, and create B-A containers.

In the very specific case (list of plain integers), I propose that you merely count them, since you cannot distinguish different integers, anyway:

for(int i = 0; i < countOfNumbers; i++)
    counter[number[i]]++;

If they are distinguishable, create an array of lists, and add them to the respective list.

If they are not numbers, use a std::map or std::hash_map, mapping keys to lists of values.

The simplest is probably to just sort the list, and then to iterate over it looking for dups.

If you know something about the data, more efficient algorithms are possible.

For example, if you knew the list was large, and only contained integers from 1..n where n is fairly small, you could use a pair of boolean arrays (or a bitmap), and do something like this:

bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
    if (once[ value[i] ])
        many[ value[i] ] = true;
    else
        once[ value[i] ] = true;

Now, many[] contains an array of which values were seen more than once.

Most people mentioning hash/unordered map solutions are assuming O(1) insertion and query time, however it could be O(N) worst case. Additionally, you void the cost of object hashing.

Personally I would insert objects into a binary tree (O(logn) insertion for each), and keep counter at each node. This would yield O(nlogn) construction time and O(n) traversal to identify all duplicates.

If I've understood the question correctly, then this is the simplest solution I can think of:

std::vector<T> input;
typedef std::vector<T>::iterator iter;

std::vector<std::pair<iter, iter> > output;

sort(input.begin(), input.end()); // O(n lg n) comparisons

for (iter cur = input.begin(); cur != input.end();){
  std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
  cur = range.second;
  output.push_back(range);
}

Total running time: O(n log n). You have one sorting pass O(n lg n) and then a second pass where O(lg n) comparisons are performed for each group (so this is done at most n times, also yielding O(n lg n).

Note that this depends on the input being a vector. Only random access iterators have logarithmic complexity in the second pass. Bidirectional iterators would be linear.

This doesn't rely on hashing (as requested), and it preserves all the original elements (rather than just returning one element for each group, along with a count of how often it occurred).

Of course, a number of smaller constant optimizations are possible. output.reserve(input.size()) on the output vector would be a good idea, to avoid reallocation. input.end() is taken much more often than necessary too, and could be easily cached.

Depending on how big the groups are assumed to be, equal_range may not be the most efficient choice. I assume it does a binary search to get logarithmic complexity, but if each group is only a couple of elements, a simple linear scan would have been faster. In any case, the initial sorting dominates the cost though.

Just to register that I had the same problem during a normalization of a triple store that I am working with. I implemented the method 3, summarized by Charles Bailey, in Common Lisp using the hash-table feature from Allegro Common Lisp.

The function "agent-equal?" is used to test when two agents in the TS are the same. The function "merge-nodes" merges the nodes on each cluster. In the code below, the "..." was used to remove the not so important parts.

(defun agent-equal? (a b)
 (let ((aprops (car (get-triples-list :s a :p !foaf:name)))
       (bprops (car (get-triples-list :s b :p !foaf:name))))
   (upi= (object aprops) (object bprops))))

(defun process-rdf (out-path filename)
 (let* (...
        (table (make-hash-table :test 'agent-equal?)))
  (progn 
   ...
   (let ((agents (mapcar #'subject 
          (get-triples-list :o !foaf:Agent :limit nil))))
     (progn 
       (dolist (a agents)
          (if (gethash a table)
            (push a (gethash a table))
            (setf (gethash a table) (list a))))
       (maphash #'merge-nodes table)
           ...
           )))))

Since C++11, hash tables are provided by the STL with std::unordered_map. So a O(N) solution is to put your values into an unordered_map< T, <vector<T> >.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow