Elasticache & availability zones

Question 1

You are right that multi availability zones feature is not supported yet in ElasticCache. However it is usually not a big problem with the low latency of 1ms between AZs.

The purpose of cache is to make long and frequent SQL queries served from memory. That is instead of 300ms SQL query you can serve it with a single memory lookup. Compared to that 1ms network latency shouldn't be a problem.

The second property of a cache as ElasticCache is that you are warming it up and keeping it warm with live data from your database. You should never expect your cache to be current as a whole, as the data in the background is changing all the time. Losing a cache node in a cluster is expected (as any other failure in a large system), as your system should warm up the newly created cache node rather quickly. ElasticCache will replace the failed node for you, but you need to fill it up again with cache data.

Regarding redundancy between availability zone you can check AWS description:

Setting up redundant Cache Clusters in different Availability Zones

Amazon ElastiCache monitors the health of your Cache Nodes and replaces them in the event of network partitioning, host hardware or software failure. However, given the ephemeral nature of a cache, Cache Node replacements begin empty (also called “cold”), and depending on your workload pattern, may take some time to be re-populated with data (also called “warming up”). Additionally, the auto-replacement functionality provided by Amazon ElastiCache is restricted to a single Availability Zone. If your application is sensitive to the failure recovery or the “warm up” time of Cache Nodes, or you want enhanced fault-tolerance for Availability Zone level failures, you may wish to deploy redundant ElastiCache Clusters in different Availability Zones.

One of the ways to manage data redundancy is to have your application apply all cache writes to Cache Nodes across these Availability Zones. If one or more of your Cache Nodes in the primary Availability Zone fails, you could direct reads to the corresponding Cache Node(s) in the secondary Availability Zone while Amazon ElastiCache restores the Cache Node(s) in the primary Availability Zone.

Question 2

Latency between availability zones is typically around 2ms, so no, it's normally not a problem.

I'd really need to know more about how you're using it be able to address the second part of your question. Since it's just a caching layer, frequently the application can just run in degraded mode until either AWS fixes the problem or manual intervention on your part. Or, the application could be designed to automatically failover to a second cluster in a different availability zone. When that happens, the cache will have to rebuilt from the persistent datastore. You can either just let the cache misses happen, or your prime it before the application starts using it.