Question

Starting last night around 2:00 am - some 8 hours after anybody touched anything having to do with the website - our Azure website began throwing this error:

Error: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.). Additional Information : The client was trying to communicate with the server: net.tcp://payboardprod.cache.windows.net:22233. ( at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ErrStatus errStatus, Guid trackingId, Exception responseException, Byte[][] payload, EndpointID destination)

Basically, it looks as if our Azure cache server took a dive. But there's no indication of this anywhere on our Azure management console, which indicates that the caching server in question is up and running just fine. Nor is there any indication of a problem on the Azure service availability dashboard (http://azure.microsoft.com/en-us/support/service-dashboard/). The only indication of any sort of a problem is that our Azure cache service started reporting zero requests around 1:00 am.

Azure cache graph

Our beta site, which uses a different caching server but is otherwise configured identically, stayed up through this whole episode.

We just have a BizSpark account, and hence no ability to open support tickets with MS.

We've restored service by disabling external caching, but that's obviously not optimal.

Any suggestions for troubleshooting this?

Was it helpful?

Solution

Wrap your calling code in appropriate protection (try / catch) and then cope with the failure at the app tier. The commodity platform offered in any cloud can (and does) have these sorts of issues from time-to-time. You need to bake in logging and log somewhere like Azure Diagnostics (http://msdn.microsoft.com/en-us/library/gg433048.aspx) for later troubleshooting.

OTHER TIPS

I still haven't figured out what the problem was, and just ended up following Simon W's advice about wrapping everything in try/catches up the wazoo. But because it's not 100% intuitive, and took me several tries to get the code for cache retrieval right, I thought I'd post it here for anybody else who's interested.

public TValue Get(string key, Func<TValue> missingFunc)
{
    // We need to ensure that two processes don't try to calculate the same value at the same time. That just wastes resources.
    // So we pull out a value from the _cacheLocks dictionary, and lock on that before trying to retrieve the object.
    // This does add a bit more locking, and hence the chance for one process to lock up everything else.
    // We may need to add some timeouts here at some point in time. It also doesn't prevent two processes on different
    // machines from trying the same bit o' nonsense. Oh well. It's probably still a worthwhile optimization.
    key = _keyPrefix + "." + key;
    var value = default(TValue);
    object cacheLock;
    lock (_cacheLocks)
    {
        if (!_cacheLocks.TryGetValue(key, out cacheLock))
        {
            cacheLock = new object();
            _cacheLocks[key] = cacheLock;
        }
    }
    lock (cacheLock)
    {
        // Try to get the value from the cache.
        try
        {
            value = _cache.Get(key) as TValue;
        }
        catch (SerializationException ex)
        {
            // This can happen when the app restarts, and we discover that the dynamic entity names have changed, and the desired type 
            // is no longer around, e.g., "Organization_6BA9E1E1184D9B7BDCC50D94471D7A730423456A15BBAFB6A2C6AC0FF94C0D41"
            // If that's the error, we should probably warn about it, but no point in logging it as an error, since it's more-or-less expected.
            _logger.Warn("Error retrieving item '" + key + "' from Azure cache; falling back to missingFunc(). Error = " + ex);
        }
        catch (Exception ex)
        {
            _logger.Error("Error retrieving item '" + key + "' from Azure cache; falling back to missingFunc(). Error = " + ex);
        }

        // If we didn't get anything interesting, then call the function that should be able to retrieve it for us.
        if (value == default(TValue))
        {
            // If that function throws an exception, don't swallow it.
            value = missingFunc();

            // If we try to put it into the cache, and *that* throws an exception, 
            // log it, and then swallow it.
            try
            {
                _cache.Put(key, value);
            }
            catch (Exception ex)
            {
                _logger.Error("Error putting item '" + key + "' into Azure cache. Error = " + ex);
            }
        }
    }
    return value;
}

You can use it like so:

var user = UserCache.Get(email, () =>
    _db.Users
        .FirstOrDefault(u => u.Email == email)
        .ShallowClone());
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top