How to make active services highly available?

https://stackoverflow.com/questions/2655933

27-09-2019
|

Question

I know that with Network Load Balancing and Failover Clusteringwe can make passive services highly available. But what about active apps?

Example: One of my apps retrieves some content from a external resource in a fixed interval. I have imagined the following scenarios:

Run it in a single machine. Problem: if this instance falls, the content won't be retrieved
Run it in each machine of the cluster. Problem: the content will be retrieved multiple times
Have it in each machine of the cluster, but run it only in one of them. Each instance will have to check some sort of common resource to decide whether it its turn to do the task or not.

When I was thinking about the solution #3 I have wondered what should be the common resource. I have thought of creating a table in the database, where we could use it to get a global lock.

Is this the best solution? How does people usually do this?

By the way it's a C# .NET WCF app running on Windows Server 2008

Solution

For such problems they have invented message queues. Imagine the case when your clustered applications all listen to a message queue (clustered itself :-)). At some point in time one instance gets your initial command to download your external resource. If successful, your instance flushes the message and instead it posts another one for a later execution time that's equal to 'the run time' + 'interval'. But in case the instance dies during processing, that's not a problem. The message is rolled back in the queue (after timeout) and some other instance can pick it up. A bit of transactions, a bit of message queues

I am on the Java EE side of the world so can help you with coding details

OTHER TIPS

I have once implemented something similar using your solution #3.

Create a table called something like resource_lock, with a column (e.g. locking_key) that will contain a locking key.

Then at each interval, all instance of your app will:

Run a query like 'update resource_lock set resource_key = 1 where resource_key is null'. (you can of course also insert a server-specific id, a timestamp, etc.)
If 0 rows updated: do nothing - another app instance is already fetching the resource.
If 1 row updated: fetch the resource and set locking_key back to null.

There are two advantages with this:

If one of your servers fails, the resource will still be fetched by the servers that are still running.
You leave the locking to the database, this saves you from implementing it yourself.

There are some requirements that you probably know but have not been described in the question that make giving an informed answer challenging. Some of these questions are:

Does the task have to complete successfully?
If the task does/does not complete successfully, "who" needs to know and what type of actions need to be performed?
What is the behavior if the task has not completed when the time comes to run the task again? Should it run or not?
How important is it that jobs run at the specified interval? If the interval is every 5 minutes does it have to be every 5 minutes or could the task run after 5 minutes and 10 seconds?

The first step is to answer how the periodic task will be scheduled to run. One option is a Windows Scheduled Task but that is not inherently highly available but it may be possible to work around that. If you are using SQL Server, another alternative would be to use SQL Server Agent as a scheduler since it will failover as part of SQL Server.

The next step to determine is how to invoke the WCF application. The easiest option would be to trigger a job to invoke the WCF service through a NLB IP address. This could be considered a no-no if the database server (or other server in that zone) is calling to the application zone (of course there are always exceptions such as MSDTC).

Another option would be use the queue model. This would be the most reliable in most situations. e.g. SQL Server Agent could execute a stored procedure to enter a record in a queue table. Then on each application server a service could poll looking for a queued record to process. Access to the record in the queue would be serialized by the database so that the first server in would run the job (and that job would only run once).

Depending on the answers to the opening questions in this answer you may have to add some more error handling. If the retrieval of the external resource is usually fairly short, you may want to simply keep the queue record locked with a select for update and when the task is completed update the status (or delete the record if you wish). This will block other service instances from processing the record while it is being processed on another server and if a crash occurs during processing the transaction should be rolled back and another service in the cluster can pick up the record. (Although, you could increase the transaction timeout to as long as you think you need.)

If keeping a database lock for a long time is not viable then you could change the logic and add some monitoring to the services. Now, when a job is started processing, its status would be changed from queued to running and the server that is processing the record would be updated on the record. Some sort of service status table could be created and each service instance would update the the current time every time they poll. This would allow other services in the cluster to reprocess jobs that show as running but the service they are supposed to be running on hasn't "checked in" within a certain period.

This approach also has limitations: what if the task actually completed but somehow database connectivity was lost -- the job could potentially run again. Of course, I don't think the problem of having atomic database actions combined with other non-transactional resources (e.g. web request, file system) is going to easily be solved. I'm assuming you are writing a file or something -- if the external content is also placed into a database then a single transaction will guarantee that everything is consistent.

From the simplicity point of view, the quickest/easiest way to accomplish what you're looking for would be to 'round-robin' your cluster so that for every request, a machine is selected (by a cluster management service or some such) to process a request. Actual client requests don't go directly to the machine that handles it; they instead point to a single endpoint, which acts as a proxy to distribute incoming requests to machines based on availability and load. To quote the below-referenced link,

Network Load Balancing is a way to configure a pool of machines so they take turns responding to requests. It’s most commonly seen implemented in server farms: identically configured machines that spread out the load for a web site, or maybe a Terminal Server farm. You could also use it for a firewall(ISA) farm, vpn access points, really, any time you have TCP/IP traffic that has become too much load for a single machine, but you still want it to appear as a single machine for access purposes.

As for your application being "active", that requirement does not factor into this equation since whether 'active' or 'passive', the application still makes a request to your servers.

Commercial load balancers exist for serving HTTP-style requests, so that may be worth looking into, but with the load balancing features of W2k8, you may be best served tapping into those.

For more info on how to configure that in Win2k8, see this article.

this article is much more technical and focuses on using NLB with Exchange, but the principles should still apply to your situation.

see here for another detailed walk-through of NLB setup and configuration.

Failing that, you may be well served by searching / posting on ServerFault, since your application code is not (and should not be) strictly aware that the NLB even exists.

EDIT: added another link.

EDIT (the 2nd): The OP has corrected my erroneous conclusion in the 'active' vs. 'passive' concept. My answer to that is very similar to my original answer, save that the 'active' service (which, since you're using WCF, could easily be a windows service) could be split into two parts: the actual processing portion, and the management portion. The management portion would run on a single server, and act as a round-robin load balancer for the other servers doing the actual processing. It's slightly more complicated than the original scenario, but I believe it would provide a good deal of flexibility as well as offer a clean separation between your processing and management logic.

In some cases people find it useful to have 3 machines doing all of the requests, and then compare the results at the end, to make sure that the result is absolutely correct and no hardware failure caused any problems while processing it. This is what they do on for instance airplanes.

At other times, you can live with having a single bad result and a small downtime to switch to a new service, but just want the next one to be ok. In that case solution number 3 with a heart beat monitor is an excellent setup.

Other times again, people just needs to be notified with an SMS that their service is down and the application will just use some obsolete data until you manually perform some kind of failover.

In your case, I would say the latter is probably more useful for you. Since you cannot really depend on the service at the other end being available, you would still have to come up with a solution for what to do in that case. Giving back obsolete data may be what is good for you, and it may not be. Sorry to have to say: It depends.

Zookeeper makes a good use case of distributed locks. Zookeeper have z-nodes which are like directory with data.

Even netflix curator has lot of recipes already done and to use. Like : leader election, distributed lock and many more.

I think we have client of zookeeper for C#. You should definitely try this options. #Option3

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow