Creating a High Availability AppFabric Cache Cluster

https://stackoverflow.com/questions/12377194

01-07-2021
|

Question

Is there anything aside from setting Secondaries=1 in the cluster configuration to enable HighAvailability, specifically on the cache client configuration?

Our configuration:

Cache Cluster (3 windows enterprise hosts using a SQL configuration provider):
Cache Clients

With the about configuration, we see primary and secondary regions created on the three hosts, however when one of the hosts is stopped, the following exceptions occur:

ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.
An existing connection was forcibly closed by the remote host
No connection could be made because the target machine actively refused it 192.22.0.34:22233
An existing connection was forcibly closed by the remote host

Isn't the point of High Availability to be able to handle hosts going down without interrupting service? We are using a named region - does this break the High Availability? I read somewhere that named regions can only exist on one host (I did verify that a secondary does exist on another host). I feel like we're missing something for the cache client configuration would enable High Availability, any insight on the matter would be greatly appreciated.

Solution

After opening a ticket with Microsoft we narrowed it down to having a static DataCacheFactory object.

public class AppFabricCacheProvider : ICacheProvider
{
    private static readonly object Locker = new object();
    private static AppFabricCacheProvider _instance;
    private static DataCache _cache;

    private AppFabricCacheProvider()
    {
    }

    public static AppFabricCacheProvider GetInstance()
    {
        lock (Locker)
        {
            if (_instance == null)
            {
                _instance = new AppFabricCacheProvider();
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
            }
        }
        return _instance;
    }
    ...
}

Looking at the tracelogs from AppFabric, the clients are still trying to connect to all the hosts without handling hosts going down. Resetting IIS on the clients forces a new DataCacheFactory to be created (in our App_Start) and stops the exceptions.

The MS engineers agreed that this approach was the best practices way (we also found several articles about this: see link and link)

They are continuing to investigate a solution for us. In the mean time we have come up with the following temporary workaround where we force a new DataCacheFactory object to be created in the event that we encounter one of the above exceptions.

public class AppFabricCacheProvider : ICacheProvider
{
    private const int RefreshWindowMinutes = -5;

    private static readonly object Locker = new object();
    private static AppFabricCacheProvider _instance;
    private static DataCache _cache;
    private static DateTime _lastRefreshDate;

    private AppFabricCacheProvider()
    {
    }

    public static AppFabricCacheProvider GetInstance()
    {
        lock (Locker)
        {
            if (_instance == null)
            {
                _instance = new AppFabricCacheProvider();
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
                _lastRefreshDate = DateTime.UtcNow;
            }
        }
        return _instance;
    }

    private static void ForceRefresh()
    {
        lock (Locker)
        {
            if (_instance != null && DateTime.UtcNow.AddMinutes(RefreshWindowMinutes) > _lastRefreshDate)
            {
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
                _lastRefreshDate = DateTime.UtcNow;
            }
        }
    }

    ...

    public T Put<T>(string key, T value)
    {
        try
        {
            _cache.Put(key, value);
        }
        catch (SocketException)
        {
            ForceRefresh();
            _cache.Put(key, value);
        }
        return value;
    }

Will update this thread when we learn more.

OTHER TIPS

High Availability is about protecting the data, not making it available every second (hence the retry exceptions). When a cache host goes down, you get an exception and are supposed to retry. During that time, access to HA cache's may throw a retry exception back to you while it is busy shuffling around and creating an extra copy. Regions complicate this more since it causes a larger chunk to have to be copied before it is HA again.

Also the client keeps a connection to all cache hosts so when one goes down it throws up the exception that something happened.

Basically when one host goes down, Appfabric freaks out until two copies of all data exist again in the HA cache's. We created a small layer in front of it to handle this logic and dropped the servers one at a time to make sure it handled all scenarios so that our app kept working but just was a tad bit slower.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow