Simulating node failure in a DHT

https://stackoverflow.com//questions/11689920

12-12-2019
|

Question

I'm currently doing some performance testing with the free pastry DHT. Freepastry is an open source DHT done in Java.

The goal is to monitor the effect on the DHT when a certain number of nodes go down. My problem is I'm not sure on the best way to eliminate nodes. At the moment each node is running off a different port on my machine. I'm destroying these nodes using the destroy() method from the Pastry API http://www.freepastry.org/FreePastry/javadoc21a3/rice/pastry/PastryNode.html#destroy()

I'm worried this may be unrealistic in simulating node failure and whether I should kill the nodes in a different manner such as using tcpkill?

I'm running Mac OS X snow leopard and would be interested in hearing any suggestions?

Solution

There are different forms of node failures.

The most common one is simply the node going offline because the application which is running the DHT is shut down.

The change of a dynamic IP of a domestic internet connection essentially has a subtly different effect as it invalidates all existing routing table entries about it but the over all node count doesn't go down. You lose one and you gain a new one.

Another common issue are reachability problems due to NATs. Visibility of that node may depend on the NAT type and whether you had recent contact with it.

The resulting effects of churn can actually be quite complex. First of all, the uptime of individual nodes generally follows an exponential distribution. Many are only available for a short time and very few stay stable for days or months.

Assume that you have a stable core of moderately to long-lived nodes that actually make up 90% of the network. A 10% of the same nodes constantly popping in and out of existence will cause some overhead traffic but they won't harm the network much. You have lots of churn but little impact.

If 10% of the node population goes offline after 10 minutes and is replaced by a brand-new set of nodes from the inactive pool then you're essentially loosing 10% of your redundancy every 10 minutes. If data replication between nodes does not keep up with that or doesn't even exist your data will decay exponentially. You also have lots of churn but a huge impact.

I'm not even sure what kind of simulation would reflect reality the best way. I guess the most realistic constraint is simply having a fixed pool of potential nodes. That's computers which have the DHT implementation installed.

Each node then would have a profile of time how long it stays up on average and how long it'll be down on average (where those two parameters are partially correlated with each other. long-uptime nodes generally don't have very long downtimes as they're probably always-on). And each node acts on these parameters independently. In reality the time of day also plays a role as can be easily seen here: http://dsn.tm.uni-karlsruhe.de/english/2936.php

So... long story short, just running and killing a few nodes randomly won't give you a realistic result about the resilience of a DHT, as the impact will vary widely.

As for the technical part, you'll probably want to run all of them in the same java VM and use multithreading or nonblocking IO to reduce the overhead of running each instance in a separate VM. This would also allow you to schedule their up and downtimes in a more realistic manner.

Since you can assign multiple IPs to a single computer you should be able to run hundredthousands of nodes on a computer simply based on the IP/Port count. But resource consumption of the process will eventually bog down even the fastest system, as very few DHT implementations are actually implemented to scale that well.

So you'll probably need to run this on a network with a few thousand nodes per computer to get anything close to realistic.

Either that or you resort to more mathematical simulations instead of running actual implementations.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow