Question

I have a list of ordered items of type A, who each contain a subset from a list of items B. For each pair of items in A, I would like to find the number of items B that they share (intersect).

For example, if I have this data:

A1 : B1  
A2 : B1 B2 B3  
A3 : B1  

Then I would get the following result:

A1, A2 : 1  
A1, A3 : 1  
A2, A3 : 1  

The problem I'm having is making the algorithm efficient. The size of my dataset is about 8.4K items of type A. This means 8.4K choose 2 = 35275800 combinations. The algorithm I'm using is simply going through each combination pair and doing a set intersection.

The gist of what I have so far is below. I am storing the counts as a key in a map, with the value as a vector of A pairs. I'm using a graph data structure to store the data, but the only 'graph' operation I'm using is get_neighbors() which returns the B subset for an item from A. I happen to know that the elements in the graph are ordered from index 0 to 8.4K.

void get_overlap(Graph& g, map<int, vector<A_pair> >& overlap) {

map<int, vector<A_pair> >::iterator it;

EdgeList el_i, el_j;
set<int> intersect;

size_t i, j;

VertexList vl = g.vertices();

for (i = 0; i < vl.size()-1; i++) {
    el_i = g.get_neighbors(i);

    for (j = i+1; j < vl.size(); j++) {
        el_j = g.get_neighbors(j);

        set_intersection(el_i.begin(), el_i.end(), el_j.begin(), el_j.end(), inserter(intersect, intersect.begin()));
        int num_overlap = intersect.size();

        it = overlap.find(num_overlap);
        if (it == overlap.end()) {
            vector<A_pair> temp;
            temp.push_back(A_pair(i, j));
            overlap.insert(pair<int, vector<A_pair> >(num_overlap, temp));
        }
        else {
            vector<A_pair> temp = it->second;
            temp.push_back(A_pair(i, j));
            overlap[num_overlap] = temp;
        }
    }
}

}

I have been running this program for nearly 24 hours, and the ith element in the for loop has reached iteration 250 (I'm printing each i to a log file). This, of course, is a long way from 8.4K (although I know as iterations go on, the number of comparisons will shorten since j = i +1). Is there a more optimal approach?

Edit: To be clear, the goal here is ultimately to find the top k overlapped pairs.

Edit 2: Thanks to @Beta and others for pointing out optimizations. In particular, updating the map directly (instead of copying its contents and resetting the map value) drastically improved the performance. It now runs in a matter of seconds.

Was it helpful?

Solution

I think you may be able to make things faster by pre-computing a reverse (edge-to-vertex) map. This would allow you to avoid the set_intersection call, which performs a bunch of costly set insertions. I am missing some declarations to make fully functional code, but hopefully you will get the idea. I am assuming that EdgeList is some sort of int vector:

void get_overlap(Graph& g, map<int, vector<A_pair> >& overlap) {

map<int, vector<A_pair> >::iterator it;



EdgeList el_i, el_j;
set<int> intersect;

size_t i, j;

VertexList vl = g.vertices();

// compute reverse map
map<int, set<int>> reverseMap;
for (i = 0; i < vl.size()-1; i++) {
    el_i = g.get_neighbors(i);
    for (auto e : el_i) {
        const auto findIt = reverseMap.find(e);
        if (end(reverseMap) == findIt) {
            reverseMap.emplace(e, set<int>({i})));
        } else {
            findIt->second.insert(i);
        }
    }
}

for (i = 0; i < vl.size()-1; i++) {
    el_i = g.get_neighbors(i);

    for (j = i+1; j < vl.size(); j++) {
        el_j = g.get_neighbors(j);

        int num_overlap = 0;
        for (auto e: el_i) {
            auto findIt = reverseMap.find(e);
            if (end(reverseMap) != findIt) {
                if (findIt->second.count(j) > 0) {
                    ++num_overlap;
                }
            }
        }

        it = overlap.find(num_overlap);
        if (it == overlap.end()) {
            overlap.emplace(num_overlap, vector<A_pair>({ A_pair(i, j) }));
        }
        else {
            it->second.push_back(A_pair(i,j));
        }
    }
}

I didn't do the precise performance analysis, but inside the double loop, you replace "At most 4N comparisons" + some costly set insertions (from set_intersection) with N*log(M)*log(E) comparisons, where N is the average number of edge per vertex, and M is the average number of vertex per edge, and E is the number of edges, so it could be beneficial depending on your data set. Also, if your edge indexes are compact, then you can use a simplae vector rather than a map to represent the reverse map, which removed the log(E) performance cost.

One question, though. Since you're talking about vertices and edges, don't you have the additional constraint that edges always have 2 vertices ? This could simplify some computations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top