How to run C# Task Parallel Library across multiple machines (like a render farm)?

Question 1

If you want to do this at scale (100s of nodes) then developing your own system is hard. You have to handle; nodes becoming unavailable, data replication to each node, tracking job progress.. It's a long list. You also need to consider the sort of communication you're going to require between your nodes. Remember that the cost of sending a message (data) from one thread to another is tiny compared to the cost of sending it to another machine across a network (even a fast one). You may have to completely rewrite your multithreaded application to run well on a distributed system, even to the point of using a completely different algorithm.

Hadoop

Microsoft had plans to commercialize Dryad as LINQ to HPC but this project was sidelined a while back (I worked on this project before I left Microsoft). I believe you can still get the final "public preview", but it's unsupported. The SQL team opted to work with the Hadoop/Hortonworks people on getting a Windows/Azure/.NET friendly Hadoop distribution off the ground. As far as I know the only thing they have delivered is HDInsight. A Hadoop service running in Azure.

There is now a Microsoft .NET SDK For Hadoop which will allow you to manage a cluster and submit jobs etc. It does not seem to allow you to write code that executes on the Hadoop nodes. You can however use the Hadoop streaming API. This is fairly low level but is language agnostic so you can pretty much use it to integrate map reduce code written in any language with Hadoop. More details on this can be found in this blog post.

Hadoop for .NET Developers

If you want to do this as a smaller scale (10s of nodes) then I would look for something like MPI .NET. it looks like this project has been abandoned but something similar is probably what you want.

Question 2

You might look into some like Dryad - http://research.microsoft.com/en-us/projects/dryadlinq/default.aspx

It might on the other hand also be a big too much for your situation, but the ideas in Dryad could be simplified to your needs.

You might also look into making your own TaskScheduler, which could handle the distribution of threads to agents running on other boxes, but you would have to implement a simple socket client/server communication to get and push the data.

Another and a bit odd suggestion, which might be okay for investigating things, is to do the following.

Let the master of the calculation cut the problem into the number of available client computers.
Write the parameters to kick of the calculation for each client to a file shared by all on the network.
Let the clients look for files dedicated to them, and kick of the calculation for their piece, when file appears. The output is written back to a result file.
The server will sit an listen for all clients completing their jobs.

The files could be replaced with a database, low-level sockets, REST services, Web Services etc. depending on your needs.