MapReduce, Python and NetworkX

https://stackoverflow.com/questions/1694237

18-09-2019
|

Question

I have implemented an unweighted random walk function for a graph that I built in Python using NetworkX. Below is a snippet of my program that deals with the random walk. Elsewhere in my program, I have a method that creates the graph, and I have a method that simulates various custom graph testing methods that I've written. One of these graph testing methods picks two nodes at random from the graph and runs a random walk between both of them. The two things that are being calculated from this Random Walk are hitting time (the number of links that are traversed from the starting to the ending point) and the commute time (the number of traversed links from starting to ending and back to the starting point).

def unweighted_random_walk(starting_point,ending_point, graph):
    '''
    starting_point: String that represents the starting point in the graph
    ending_point: String that represents the ending point in the graph
    graph: A NetworkX Graph object
    '''
    ##Begin the random walk
    current_point=starting_point
    #current_node=graph[current_point]
    current_point_neighors=graph.neighbors(current_point)
    hitting_time=0

    #Determine the hitting time to get to an arbitrary neighbor of the
    #starting point
    while current_point!=ending_point:
        #pick one of the edges out of the starting_node with equal probs
        possible_destination=current_point_neighbors[random.randint(0,current_point_neighors)]
        current_point=possible_destination
        current_point_neighbors=graph.neighbors(current_point)
        hitting_time+=1
    return hitting_time

My code for the random walk is pretty straight-forward because I'm just picking random nodes until the ending point is reached. However, this current implementation is very slow when I try running several random walks (I think I need to run a million at some point).

My question is: Is there any way that I can use Hadoop MapReduce to parallelize some of the operations that are going on here for this Random Walk? Is there a better way for me to do my random walk?

Solution

To address your question:

You need to address Ned's comment. He beat me to saying it. Explain your code; more on that later.
I can not fathom a walking algorithm that could be run in parallel. By their very nature, they are each a linear process; each step depends on the previous. You cannot possibly know what next node to jump to without knowing the previous node (with the exception of the starting node). If your code indeed represents a random walk where the choices are all independent of the previous ones, you need to explain that in your question.
Assuming each random walk is independent, however, you can run many random walks simultaneously. We call this scenario embarassingly parallel, and that's a very lucky thing.
I have no idea why you want to use Hadoop, specifically, here. The first step should be, "Can I just write this as a basic program and use a qsub (or equivalent) script to farm out a bunch of runs of this program to the server?" If the answer is no, the next step is, "Can I use the multiprocessing module?" If you go with multiprocessing, you might want to take a look at Jesse Noller's multiprocessing presentation from PyCon 2009.

Now, with regards to your particular code...

You need to explain what the nodes in your graph are. I'm confused why you're treating them like a dictionary (calling .keys()) on them. If they are dictionaries, tell us what the keys and values are. I hope you're not storing the neighbors as keys there, because NetworkX already gives you that, via the Graph.neighbors() method. If you're storing the neighbors of the nodes in the nodes themselves, you have a misunderstanding of the NetworkX library. Let the graph do the work for you.
You have the same logic twice in unweighted_random_walk(), once for the trip from the start node to the destination node, then again for the destination node to the start node. Why? All you need is the logic for one direction. Call this function twice. Call it with the start and destination nodes as arguments to get the direction one way, then swap the order of the arguments to be destination then start to get the walk the other direction. You then have two independent calls, and can now run these in parallel.
Don't use while True:—not just here, but in general. You should always indicate the actual condition under which to continue. e.g.,
```
while current_point != ending_point:
    ...
```
Don't return a string of the information, return the information directly. e.g.,
```
return hitting_time
```
Note that by following my advice in point 2 directly above, you only have to return the hitting time, and sum the hitting times for the there-call and the back-call to get the total commute time. Convenient, right?