surviving unscheduled VM shutdowns on Azure cloud

Here is a proposal for best practice of handling unscheduled VM shutdowns on Azure, with a 2+ node Couchbase cluster.

It is a fact that Azure virtual machines do unscheduled reboots for software updates of the host OS. This could happen as often once per month. Here is the process workflow for those updates:
http://msdn.microsoft.com/en-us/library/windowsazure/hh543978.aspx
Azure SLAs dont apply to individual VMs. To achieve SLA, Azure recommends using their load balancing system with public endpoints and multiple vms, and using the 'availablity sets' feature to make sure some subset of your VMs survive a host-OS upgrade. However the load balancing seems a bad match for Couchbase because

  • we want to use Azure's VPN/VLAN to have couchbase on a private network and have low latency between couchbase and our application servers on the same VLAN - these servers will be load balanced, but couchbase will not
  • Couchbase smart clients (like C# client) will get surely confused if they are accessing a load balancer with 1 public IP address
  • Couchbase is already load-balancing by it's design

(Windows Azure Host OS Updates) Each virtual machine hosting a Web or Worker Role receives a Stopping event, whereas VM Roles receive a standard Windows shutdown event. Worker, Web, and Virtual machine roles are allowed five minutes to respond to the stopping and shutdown event before they are forcibly stopped.

The proposed idea, which I am going to test out as soon as I have a chance is

  1. Use the azure availability sets feature to ensure 1 or more nodes stays up during any host-OS upgrade.
  2. Five minutes is not enough time to remove a node and complete a rebalance, so failover is the only alternative for us.
  3. On each of the couchbase nodes, install a shell script in /etc/init.d that responds to OS shutdown events in within the 5 minutes allowed time frame. Call it couchbase-failover-azure-hosting
  4. couchbase-failover-azure-hosting will use the couchbase CLI tools to check now many non-failed over nodes are in the cluster. If there is 1 or more active nodes other than itself, it will failover the current node immediately. The there are not 1 or more active nodes other than itself, well that should never happen in this scenario.

The end result should be a couchbase cluster that immediately fails over nodes which are undergoing host-OS upgrade caused shutdowns.
If this scenario works out OK I will be glad to share the linux init script on github or paste it here.

Any feedback or suggestions welcome!

oops, I cannot delete or edit my comment above... this part was my writing, not a quote :
It's well known and well documented that Azure doesn't have even have SLA for individual virtual machines, because they unscheduled reboot them for host-OS maintenance.

2 Answers

« Back to question.

I think the plan sounds pretty sane. One thing you may look into, but it's really a cost question more than anything else, is running with two replicas just to save you from a double node loss. This way, even if you failover and then trigger rebalance, you'll still have much higher safety.

« Back to question.

Hello,

You can find some feedback here from a user of Couchbase on Azure:
http://www.couchbase.com/forums/thread/couchbase-and-windows-azure-reboot

regards
Tug
@tgrall

Thanks Tug, yes I actually read that thread before. I don't think the responses actually addressed the core issue for a couchbase cluster, which is Azure VMs do have unscheduled reboots. Which is why I posted the issue here. :)

Oh and the person who wrote this I think is really mis-informed.

Azure Virtual Machine (launched roughly in mid April in GA = General Availability) shouldn't suffer that restrictions: indeed, we never experienced any reboot/relaunch in the latest couple of months days (the current up-time is only 32 days because we rebooted the servers by our self roughly one month ago)."It's well known and well documented that Azure doesn't have even have SLA for individual virtual machines, because they reboot them for host-OS maintenance.