Optimal way to find shared elements between combination pairs

Question

I think you may be able to make things faster by pre-computing a reverse (edge-to-vertex) map. This would allow you to avoid the set_intersection call, which performs a bunch of costly set insertions. I am missing some declarations to make fully functional code, but hopefully you will get the idea. I am assuming that EdgeList is some sort of int vector:

void get_overlap(Graph& g, map<int, vector<A_pair> >& overlap) {

map<int, vector<A_pair> >::iterator it;



EdgeList el_i, el_j;
set<int> intersect;

size_t i, j;

VertexList vl = g.vertices();

// compute reverse map
map<int, set<int>> reverseMap;
for (i = 0; i < vl.size()-1; i++) {
    el_i = g.get_neighbors(i);
    for (auto e : el_i) {
        const auto findIt = reverseMap.find(e);
        if (end(reverseMap) == findIt) {
            reverseMap.emplace(e, set<int>({i})));
        } else {
            findIt->second.insert(i);
        }
    }
}

for (i = 0; i < vl.size()-1; i++) {
    el_i = g.get_neighbors(i);

    for (j = i+1; j < vl.size(); j++) {
        el_j = g.get_neighbors(j);

        int num_overlap = 0;
        for (auto e: el_i) {
            auto findIt = reverseMap.find(e);
            if (end(reverseMap) != findIt) {
                if (findIt->second.count(j) > 0) {
                    ++num_overlap;
                }
            }
        }

        it = overlap.find(num_overlap);
        if (it == overlap.end()) {
            overlap.emplace(num_overlap, vector<A_pair>({ A_pair(i, j) }));
        }
        else {
            it->second.push_back(A_pair(i,j));
        }
    }
}

I didn't do the precise performance analysis, but inside the double loop, you replace "At most 4N comparisons" + some costly set insertions (from set_intersection) with N*log(M)*log(E) comparisons, where N is the average number of edge per vertex, and M is the average number of vertex per edge, and E is the number of edges, so it could be beneficial depending on your data set. Also, if your edge indexes are compact, then you can use a simplae vector rather than a map to represent the reverse map, which removed the log(E) performance cost.

One question, though. Since you're talking about vertices and edges, don't you have the additional constraint that edges always have 2 vertices ? This could simplify some computations.