Question

I have a very large (10M+ edges, ~5M vertexes) bipartite undirected users-items graph in format

item1: user1, user2, user3, ... 

or

userX1: itemY1
userX2: itemY2
... 

I need to convert my graph into the items-items graph where weight of edge between i and j vertexes is equal to the number of users using both items simultaneously (i.e number of elements in the intersection of sets of vertexes adjacent to item_i and item_j). And here is the problem it seems it would require me to do $O(n^2)$ operations, where $n$ is the number of edges in graph which would be impossible on simple home pc i current own. Is there any solution to this? Some probabilistic data structure will suit my needs just fine because I'm allowed to lose some small percentage of data.

Was it helpful?

Solution

Notation: m = number of edges in the old graph

  1. Sort the edge list E (version two of your proposed formats) by users (takes O(m log(m))
  2. Go through E and determine all the runs of consecutive edges with same user (O(m))
  3. Each pair of edges within a run gives you an edge in the new graph -> Add it to the edge list F of the new graph (O(|F| = n_user x max_n_items_per_user^2))
  4. Contract all duplicates in F into single edges with the edge weights given by the number of duplicates (O(|F|))

If your graph is sparse, i.e. max_n_items_per_user is small, the above should be fairly efficient. Otherwise, this algorithm has the problem that it enumerates all edges in the new graph before contracting them, so one should think about how to get the edge weight directly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top