how to extract a sample from KNeighborhoodFilter

https://stackoverflow.com/questions/19813853

04-07-2022
|

Question

I am building a bipartite network generator and I am using the code in How to filter the result of KNeighborhoodFilter? and it works perfectly when my network is small (5000 nodes).

Now I am working with a network with 60.000 nodes and 250.000 links. To speed up things, I am wondering if it is possible to just take a random sample of nodes when extracting the 2-dist neighbors of a node, say just 50% of 2-dist neighbors...

I really have no clue on how to achieve this, nor if it is possible without hacking the KNeighborhoodFilter class itself (I know I won't be able to do that...).

Right now I take the result and just pick a random sample, but I don't know if I am on the right path:

Predicate<Node> onlyUsers = new Predicate<Node>() {
            @Override
            public boolean apply(Node node) {
                return node.getName().startsWith("u");
            }
        };
// find neighbors of nodes with degree i
Filter<Node, Edge> filter = new KNeighborhoodFilter<Node, Edge>(u, 2, KNeighborhoodFilter.EdgeType.IN_OUT);
// retrieve everything at distance 2 from node u
List<Node> twoDistNei = Lists.newArrayList(filter.transform(zpa).getVertices());
// sample the collection
List<Node> sampledUsers = Lists.newArrayList();
for (int i = 0; i < 2000; i++) {
   sampledUsers.add(twoDistNei.get(context.getRNG().nextInt(twoDistNei.size())));
}
Set<Node> sampledNodesHashed = Sets.newHashSet(sampledNodes);
Set<Node> twoDistUsers = Sets.newHashSet(Collections2.filter(sampledNodesHashed, onlyUsers));

My goal is to make this code run faster. Thank you very much for your time.

Best regards, Simone

Solution

If you want to go faster, here are a couple of possible approaches. They both involve some hacking. :)

(1) Modify KNeighborhoodFilter to give you a version in which you traverse edges with some probability p.
A generalization of this is in which you specify a probability for each hop, i.e., use one probability for the first hop, one for the second hop, etc.
A further generalization would use separate probabilities for each edge, possibly based on some property of the nodes/edges involved (e.g., maybe you want to follow edges more or less often if the target has high degree).

Note that this solution involves giving some serious consideration to what kind of random sampling is appropriate in your case. If what you really want is a uniform random sampling of the nodes that would have been in the full result, then you basically need to generate the full result and then filter it.

This is probably the easiest solution; it wouldn't involve much work at all, really.

(2) Create a modified version of KNeighborhoodFilter that divides the work up among different processors. You're probably running this on a machine with > 1 core, so you could create multiple threads and let Java allocate them to different machines; if you really want to speed this up (and you have an appropriate setup) then you could farm it out to different machines, MapReduce-style.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow