How to failover Azure ACS if a data center goes down

https://stackoverflow.com/questions/11163647

16-06-2021
|

Pergunta

We are looking for a way to provide failover for ACS instances, so if one data-center goes offline, authentication via ACS automatically fails over into another data center.

Background:

We use ACS to transform SAML tokens that are provided by a custom-developed STS via the WS-Trust protocol. ACS is used to broker trust between our STS and a number of relying parties that are developed by 3rd parties. The relying parties are currently configured to connect to a specific ACS instance using its DNS URL.

We have looked into the following:

Using a DNS CName entry to mask the ACS url - doesn’t work because the new DNS will not match the SSL cert on the instance, and we can’t control the SSL Cert.
Using a proxy in front of ACS to route the requests to it - doesn’t work because the To address and Realm in the messages doesn’t match the acs namespace.
Traffic Manager doesn’t work because of both 1 and 2, and because it won’t currently let you direct load to an address that doesn’t end in .cloudapp.net.

Solução

I don't think there is a realistic and foolproof solution here. As noted, you can create additional namespaces in other datacenters and take backups of your RP configs and transformation rules. To recover, your clients would need to reconfigure their apps to use the new namespace after you restore a backup to the new namespace. This can work in some scenarios (like Google and Yahoo! integration). It can even work (I think) for Active Directory integration. It is very problematic if you don't control the RP however.

A different, but blocking problem with this approach as well (for us at least) is that it won't work in the case of Windows Live name identifier claims. We get a different one per namespace for our users. So, even if we restored all our settings in another datacenter (and we control the RPs too!), our Windows Live users would be unable to login correctly because their name identifiers would no longer match with the new namespace. Google and Yahoo! would not have this problem as they can use a stable claim (like email).

Basically, it appears you are mostly at the mercy of the datacenter operations team to failover to the subregion quickly in case of total datacenter loss.

Outras dicas

Not sure if this helps, but you might be able to do some custom on-premises solution in the event of a DC crash for ACS. Using the Windows Azure Cmdlets along with an RSS poll to the Service Bus Dashboard might work.

See below on Guidance from MSFT on this topic for SB 2.0 namespaces...

ACS 2.0 Namespaces

ACS 2.0 takes backups of all namespaces once per day and stores them in a secure offsite location. When ACS operation staff determines there has been an unrecoverable data loss at one of ACS’s regional data centers, ACS may attempt to recover customers’ subscriptions by restoring the most recent backup. Due to the frequency of backups data loss up to 24 hours may occur.

ACS 2.0 customers concerned about potential for data loss are encouraged to review a set of Windows Azure PowerShell Cmdlets available through the Microsoft hosted Codeplex Open Source repository. These scripts allow administrators to manage their namespaces and to import and extract all relevant data. Through use of these scripts, ACS customers have the ability develop custom backup and restore solutions for a higher level of data consistency than is currently offered by ACS.

Notification In the event of a disaster, information will be posted at the Windows Azure Service Dashboard describing the current status of all Windows Azure services globally. The dashboard will be updated regularly with information about the disaster. If you want to receive notifications for interruptions to any the services, you can subscribe to the service’s RSS feed on the Service Dashboard. In addition, you can contact customer support by visiting the Support options for Windows Azure web page and follow the instructions to get technical support for your service(s).

HTH

First of all there is no ACS backup solution exist in Azure so developers and users are open to create what the best they could come up. Based on my understanding if you want to create a Fail-over scenario for your application to role over from one ACS to another ACS, that can be done in your relying party application (website) as below:

You have ACS1 and ACS2 configured where ACS2 is the backup. Both ACS use the configured to use same Relying Party Application with identical Realm and Return URL
In your Relying Party application, when there is a failure to login to ACS, ACS provides JSON-encoded HTTP URL parameter error details to the relying party application

2.1 It is possible that error was withing ACS 2.2 It is possible the ACS endpoint was not even found
In both cases you can handle the error in your code and create a Retry Policy to try ACS2. You can add code to try when to go ACS2 and when to keep trying ACS1 depend on how do you want.

If you end up having 2 ACS endpoint, just try to keep them identical so you will get exact same result no matter which one actually authenticate to RP application request.

If you want to backup ACS at management level take a look at Windows Azure AppFabric Access Control Service (ACS) – Backup and Restore Resources, it might required you to be available in case of ACS failure otherwise, you may want to automate it in your RP application (big work).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow