Scaling out Windows Services

Question 1

I solved this by coding all this scalability and redundancy stuff on my own. I will explain what I did and how I did it, should anyone ever need this.

I created a few processes in each instance to keep track of the others and know which records the particular instance can process. On start up, the instance would register in the database (if it's not already) in a table called Instances. This table has the following columns:

Id                 Number
MachineName        Varchar2
LastActive         Timestamp
IsMaster           Number(1)

After registering and creating a row in this table if the instance's MachineName wasn't found, the instance starts pinging this table every second in a separate thread, updating its LastActive column. Then it selects all the rows from this table and makes sure that the Master Instance (more on that later) is still alive - meaning that it's LastActive time is in the last 10 seconds. If the master instance stopped responding, it will assume control and set itself as master. In the next iteration it will make sure that there is only one master (in case another instance decided to assume control as well simultaneously), and if not it will yield to the instance with the lowest Id.

What is the master instance?
The service's job is to scan a logging table and process that data so people can filter and read through it easily. I didn't state this in my question, but it might be relevant here. We have a bunch of ESB servers writing multiple records to the logging table per request, and my service's job is to keep track of them in near real-time. Since they're writing their logs asynchronously, I could potentially get a finished processing request A before started processing request A entry in the log. So, I have some code that sorts those records and makes sure my service processes the data in the correct order. Because I needed to scale out this service, only one instance can do this logic to avoid lots of unnecessary DB queries and possibly insane bugs.
This is where the Master Instance comes in. Only it executes this sorting logic and temporarily saves the log record Id's in another table called ReportAssignment. This table's job is to keep track of which records were processed and by whom. Once processing is complete, the record is deleted. The table looks like this:

RecordId        Number
InstanceId      Number    Nullable

The master instance sorts the log entries and inserts their Id's here. All my service instances check this table in 1 second intervals for new records that aren't being processed by anyone or that are being processed by an inactive instance, and that the [record's Id] % [number of isnstances] == [index of current instance in a sorted array of all the active instances] (that were acquired during the Pinging process). The query looks somewhat like this:

SELECT * FROM ReportAssignment 
WHERE (InstanceId IS NULL OR InstanceId NOT IN (1, 2, 3))   // 1,2,3 are the active instances
AND RecordId % 3 == 0    // 0 is the index of the current instance in the list of active instances

Why do I need to do this?

The other two instances would query for RecordId % 3 == 1 and RecordId % 3 == 2.
RecordId % [instanceCount] == [indexOfCurrentInstance] ensures that the records are distributed evenly between all instances.
InstanceId NOT IN (1,2,3) allows the instances to take over records that were being processed by an instance that crashed, and not process the records of already active instances when a new instance is added.

Once an instance queries for these records, it will execute an update command, setting the InstanceId to its own and query the logging table for records with those Id's. When processing is complete, it deletes the records from ReportAssignment.

Overall I am very pleased with this. It scales nicely, ensures that no data is lost should the instance go down, and there were nearly no alterations to the existing code we have.

Question 2

For your work items, Windows Workflow is probably your quickest means to refactor your service.

Windows Workflow Foundation @ MSDN

The most useful thing you'll get out of WF is workflow persistence, where a properly designed workflow may resume from a Persist point, should anything happen to the workflow from the last point at which it was saved.

Workflow Persistence @ MSDN

This includes the ability for a workflow to be recovered from another process should any other process crash while processing the workflow. The resuming process doesn't need to be on the same machine if you use the shared workflow store. Note that all recoverable workflows require the use of the workflow store.

For work distribution, you have a couple options.

A service to produce messages combined with host-based load balancing via workflow invocation using WCF endpoints via the WorkflowService class. Note that you'll probably want to use the design-mode editor here to construct entry methods rather than manually setup Receive and corresponding SendReply handlers (these map to WCF methods). You would likely call the service for every Message, and perhaps also call the service for every Report. Note that the CanCreateInstance property is important here. Every invocation tied to it will create a running instance that runs independently.
~
WorkflowService Class (System.ServiceModel.Activities) @ MSDN
Receive Class (System.ServiceModel.Activities) @ MSDN
Receive.CanCreateInstance Property (System.ServiceModel.Activities) @ MSDN
SendReply Class (System.ServiceModel.Activities) @ MSDN
Use a service bus that has Queue support. At the minimum, you want something that potentially accepts input from any number of clients, and whose outputs may be uniquely identified and handled exactly once. A few that come to mind are NServiceBus, MSMQ, RabbitMQ, and ZeroMQ. Out of the items mentioned here, NServiceBus is exclusively .NET ready out-of-the-box. In a cloud context, your options also include platform-specific offerings such as Azure Service Bus and Amazon SQS.
~
NServiceBus
MSMQ @ MSDN
RabbitMQ
ZeroMQ
Azure Service Bus @ MSDN
Amazon SQS @ Amazon AWS
~
Note that the service bus is just the glue between a producer that will initiate Messages and a consumer that can exist on any number of machines to read from the queue. Similarly, you can use this indirection for Report generation. Your consumer will create workflow instances that may then use workflow persistence.
Windows AppFabric may be used to host workflows, allowing you to use many techniques that apply to IIS load balancing to distribute your work. I don't personally have any experience with it, so there's not much I can say for it other than it has good monitoring support out-of-the-box.
~
How to: Host a Workflow Service with Windows App Fabric @ MSDN