One node crash will cause several minutes failure of total cluster

We do some tests and also encounter the critical exception of Couchbase cluster in prod environment.

The issue summary
--------
One node shutdown wittingly will cause:
1) Get ops of the cluster becomes 0
2) Some of the putting records will be lost
3) Need several minutes to "auto fail-over", and then the cluster will resume to OK. ( We set auto fail-over time is 30s)

The test environment
------------
1. Couchbase 2.2
2. 4 nodes
3. Replicas: 2
4. Every node: 4CPU, 16G memory
5. C SDK version: 2.0.6

Testcase 1
--------------
Steps:
0. 10m records are set in the cluster, auto fail-over is enabled
1. 3 clients to get from cluster
2. Shutdown one node unexpectedly (crash)
Result:
1. All gets encounter timeouts, cannot get any data.
2. About 4 minutes later, the cluster resume to OK.

Testcase 2
-----------
Steps:
0. 10m records are set in the cluster, auto fail-over is enabled
1. Start clients to write 500k
2. Shutdown one node when writing
Result:
1. Just the node shutdown, writing ops becomes 0
2. After 1 minute the cluster resumes OK
3. When checking the data: a) Writing timeouts count is 13 b) 5087 records cannot be found in the 500k c) Some of the originally set 10m records are lost d) The lost records cannot be found even mannually rebalanced.
Attached screenshots:

ops drop to 0 when one node crash

ops resume after 1 minute

one node is fail-over state

after rebalance action, the records count

The count should be: 10m + 500k. From desc of testcase 2 result, Couchbase lost some new data and also old existed data.

What clients are you using?

C SDK (client) version: 2.0.6

1 Answer

« Back to question.

Hello,

The architecture of Couchbase is build to react as follow during a failure.

"One node shutdown wittingly will cause: ..."
So when the node is down (and until the fail over):
- you cannot read to this node, so a subset of your keys cannot be saved, for example with 4 nodes
- you cannot read the active document from this node, you can use replica read operation ( see https://gist.github.com/tgrall/6540011 )
- since you have replicas, you have not lose any data, you just have no access (except with replica read) and write to a subset of the data. (Couchbase is a CP database if you look at the CAP theorem)

Once again this is the behavior until you do the fail over (or automatically done). The fail over will promote some replicas as active documents (the fail over operation is very fast, just changing flags in memory).
Once the fail over is done, you have access (in read and write) to all the data, but on 1 less node.

Usually you do a rebalance, to be sure you get all the data and its replica well balanced.

So based on this behavior I do not see why you see (or think):
- data loss
- long time failure of the cluster. (the only period when you may get some error is before the fail over and yes when you use automatic failover it can takes some time)

Regards
Tug
@tgrall