So here's the original paper, a neat lil' six pages, only two of which are about the design & implementation. Here's a cliffnotes:
- For a partition of a given graph, the authors define the modularity,
Q
, of the partition to be the ratio of the number of edges within each community to the number of edges between each community, minus the ratio you'd expect from a completely random partition. - So it's effectively "how much better is this partition at defining communities than a completely random one?"
- Given two communities
i
andj
of a partition, they then definedeltaQ_ij
to be how much the modularity of the partition would change if communitiesi
andj
were merged. So ifdeltaQ_ij > 0
, mergingi
andj
will improve the modularity of the partition. - Which leads to a simple greedy algorithm: start with every node in its own community. Calculate
deltaQ_ij
for every pair of communities. Whichever two communitiesi, j
have the largestdeltaQ_ij
, merge those two. Repeat. - You'll get maximum modularity when the
deltaQ_ij
all turn negative, but in the paper the authors let the algorithm run until there's only one community left.
That's pretty much it for understanding the algorithm. The details are in how to compute deltaQ_ij
quickly and store the information efficiently.
Edit: Data structure time!
So first off, I think the implementation you're referencing does things a different way to the paper. I'm not quite sure how, because the code is impenetrable, but it seems to use union-find and hashsets in place of the author's binary trees and multiple heaps. Not a clue why they do it a different way. You might want to email the guy who wrote it and ask.
Anyway, the algorithm in the paper needs several things from the format deltaQ
is stored in:
- First, it needs to be able to recover the largest value in
dQ
quickly. - Second, it needs to be able to remove all
deltaQ_ik
anddeltaQ_ki
for a fixedi
quickly. - Third, it needs to be able to update all
deltaQ_kj
anddeltaQ_jk
for a fixedj
quickly.
The solution the authors come up to for this is as follows:
- For each community
i
, each non-zerodeltaQ_ik
is stored in a balanced binary tree, indexed byk
(so elements can be found easily), and in a heap (so the maximum for that community can be found easily). - The maximum
deltaQ_ik
from each communityi
's heap is then stored in another heap, so that the overall maximums can be found easily.
When community i
is merged with community j
, several things happen to the binary trees:
- First, each element from the
i
th community is added to thej
th community's binary tree. If an element with the same indexk
already exists, you sum the old and new values. - Second, we update all the remaining "old" values in the
j
th community's binary tree to reflect the fact that thej
th community has just increased in size. - And for each other community's binary tree
k
, we update anydeltaQ_kj
. - Finally, the tree for community
i
is thrown away.
And similarly, several things must happen to the heaps:
- First, the heap for community
i
is thrown away. - Then the heap for community
j
is rebuilt from scratch using the elements from the community's balanced binary tree. - And for each other community
k
's heap, the position of entrydeltaQ_kj
is updated. - Finally, the entry for community
i
in the overall heap is thrown away (causing bubbling) and the entries for communityj
and each communityk
connected toi
orj
are updated.
Strangely, when two communities are merged there's no reference in the paper as to removing deltaQ_ki
values from the k
th community's heap or tree. I think this might be dealt with by the setting of a_i = 0
, but I don't understand the algorithm well enough to be sure.
Edit: Trying to decipher the implementation you linked. Their primary datastructures are
CmtyIdUF
, a union-find structure that keeps track of which nodes are in which community (something that's neglected in the paper, but seems necessary unless you want to reconstruct community membership from a trace of the merge or something),MxQHeap
, a heap to keep track of whichdeltaQ_ij
is largest overall. Strangely, when they update the value of aTFltIntIntTr
in the heap, they don't ask the heap to re-heapify itself. This is worrying. Does it do it automatically or something?CmtyQH
, a hashmap that maps a community IDi
to a structureTCmtyDat
which holds what looks heap of thedeltaQ_ik
for that community. I think. Strangely though, theUpdateMaxQ
of theTCmtyDat
structure takes linear time, obviating any need for a heap. What's more, theUpdateMaxQ
method only appears to be called when an element of the heap is deleted. It should definitely also be getting called when the value of any element in the heap is updated.