Single Node Failure Scenario
We've got a new datacenter with about 20 nodes in it (all 1.7.0, btw), and it's been sustaining a moderate level of traffic for a while. We've had less luck with the hardware. In this case, the RAID controller failed, rendering the disk read-only.
The membase instance on this node is unhappy, needless to say, as its write queues fill up, writes timeout, etc. Our ops team hit the Fail Over button, but upon seeing data loss warning, realized that the node is still up -- it's just not able to write to the disk. We don't want to lose that data, especially since it's still in accessible memory.
They tried Remove, but of course that failed, because the node needs to be healthy to participate in the rebalance.
In the datacenter, when this happens, we see cascade failure of the entire cluster :
* First, we see "I'm not responsible for this vbucket" errors from requests that I presume would route to the failed node.
* In the membase logs, we see failures from nodes that replicate to the failed node.
* Eventually (22 minutes later in the logs I'm looking at), we start to see timeouts from all membase activity across the board.
* At that point, the datacenter fails over
For the immediate membase problem, Fail Over is the the right answer, but the whole scenario leads to a couple of other questions:
1. Each of these nodes has 56 GB of RAM, of which only a fraction is used. We could fill queues for a long time before exhausting the memory. Let's say we catch the situation after spooling up 10GB of data that should live in the local vbuckets. If we hit Failover, can we expect a large percentage of that data to have propagated to the replicas, or are we going to lose all 10GB? Along the same lines, what's the sequence of events involved in the replication?
2. The cascade failure is a little unexpected. How exactly would the failure of one node back up the rest of the nodes in the cluster?
3. It would be great (for my pathological case, and without thinking about complications) if the failing node would simply keep running, reading from the disk but failing to write to it. We'd manually remove it from the cluster, and lose no data. What prevents it from operating like that in this case?