Question

I have a number of VMs on Windows Azure (Iaas) hosting a website. There are a number of load-balanced front-end VMs, all connecting to a single VM with SQL Express. It works well.

However!

I'm getting random restarts across all the VMs. As for the front-end VMs (with IIS), since they are load balanced, the site is not affected and the load balancer adjusts accordingly. But when the VM hosting the database is restarted, the site is down until the DB is up again. It takes < 3min to boot up, but that's still unacceptable if it happens frequently enough. Although the restarts are relatively rare (2 a month per VM), sometimes we get a week with 4 restarts per VM, which gets frustratingly annoying. Not all VMs restart as frequently and I cannot figure out a pattern. Restarts are also unexpected (pull-the-power-cable type of restarts, and not shutdowns). Datacenter is West Europe.

Microsoft emphasises that SLA only covers 2VMs in an availability set, which I can't have for the database VM (and the enterprise SQL edition costs an arm and three legs). Also, SQL Azure isn't an option as the application is very chatty, and the SQL Azure database was being throttled during peak times (though it works super smooth with SQL Express on a Medium VM!).

My question(s): Is it normal to have so many restarts? Are there other people having the same problem? What is your experience with such an environment on Azure? What can I do to minimise this downtime?

Thanks all!

Was it helpful?

Solution

Is it normal to have so many restarts?

Yes this can happen in a given month, you need to stand up SQL Server in high availability mode to really get this to work.

Yes it does cost an arm and leg. ;(

What is your experience with such an environment on Azure? Some months are really good some months are bad, depends on your cluster and which datacenter you are in. MS have mixed range our hardware out in there datacenters. That does not mean they are running on old laptops in some datacenters but it does mean in my experience the new datacenters tend to have better kit in them and thus less restarts. I.e we use USA East.

What can I do to minimise this downtime?

High availability with a witness is the only way to give you availability in VM and yes it cost and arm and leg.

Other serious options. Cache Cache ..You should use computer cache, azure cache and try to minmize your calls to the database. This might reduce your chatty app and allow you to step back in SQL Azure, but might give you enough to for the failover to recover back.

Queues Queues would help you application recover and give you user a message of we are working on it.

Use SQL Azure as failover. Data sync using SQL Azure Sync from Premise (Not sure this works with Express) to SQL Azure and write into you app code to pick up the connection error and failover.

Look at using other parts of Azure for parts of your app to reduce your amount of calls coming into SQL , i.e Can you move stuff to table storage ?

HTHS give you some ideas.

OTHER TIPS

Windows Azure Infrastructure Services (IaaS) has only been in General Availability (GA, or production) about 3 weeks, since April 16 (see announcement here). Prior to GA, there was no SLA and you would have seen more frequent OS restarts as various patches were still being applied to the Host OS. Are you saying that this pattern has continued at the same velocity since April 16?

Now that IaaS is GA, I wouldn't expect 4 restarts in a week. That said: there are several reasons you'd see a restart:

  • Host hardware failure (this takes down all Guest OSs running on that host)
  • Host software update (and only if requiring a restart of the Host os). Host OS reboots shouldn't be happening at the frequency you're seeing.
  • Guest OS issues. Here's where things depart from PaaS (web/worker role Cloud Services). In IaaS, there's no Guest OS maintenance done by Azure; this is all in your hands. It's possible to get reboots if installing Windows Updates automatically. Possibly you could be running into an application-level issue causing the box to become unresponsive for a long period of time, resulting in the Azure fabric controller rebooting your box as it thinks it's unhealthy. And... your app could be somehow crashing the box.

If you've ruled out application error and are sure the VMs are in good health at the time they're rebooting, you may need to open a support ticket with Microsoft to help diagnose the issue further.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top