Question: what happens before node failover

Hello!

I have a question about data availability between node is unresponsive and auto-failover and rebalance events.

Prerequisite:

There is multi-node cluster with 3 nodes, 1 bucket with 1 replica, auto-failover after 120 secs (default value), data, index and query services are running on every node. Couchbase 6.6 Community Edition

Scenario:

We make one of the nodes as unresponsive. Then Couchbase waits for 2 mins until fail broken node over, makes rebalance.

Question:

Is it possible to read data during period between one of the nodes is unresponsive and failover/rebalance happens? In our scenario we can’t read data via n1ql query/REST API till rebalance completed and got timeout response for both n1ql and REST API. At the same time write operations seem to work

Thank you!

As it is CE you’re using, could you try 7.1 ? There have been improvements in detecting / recovering from communications issues between the Query and Data services in 7.x which might help in this scenario.

Tried CE 7.1.1 (https://packages.couchbase.com/releases/7.1.1/couchbase-server-community_7.1.1-debian10_amd64.deb)

The same scenario, got the same result

n1ql query returns the following error:

[
  {
    "code": 12008,
    "msg": "Error performing bulk get operation  - cause: {3 errors, starting with dial tcp IP:11210: i/o timeout}",
    "retry": true
  }
]

REST API returns:

"errors": [{"code":12008,"msg":"Error performing bulk get operation  - cause: {2 errors, starting with dial tcp IP:11210: i/o timeout}","retry":true}],
"status": "errors",
"metrics": {"elapsedTime": "57.941964204s","executionTime": "57.941910123s","resultCount": 0,"resultSize": 0,"serviceLoad": 25,"errorCount": 1}
}
* Connection #0 to host couchbase-mtu-stdby.k2-stg.idtm.io left intact

Maybe it’s designed behavior, I’m trying to understand what should happen before node failover happens by design in my scenario?

OK, thanks for trying. I do now expect this to be as designed; data on other nodes [1] should remain accessible but any switch from attempts to use the unresponsive node will take notification from the cluster orchestrator. The auto-failover controls how long the orchestrator will try before before changing the cluster layout and notifying services.

The query.log should show when then Query node received notification of a change in bucket layout.

Just to verify, did you previously get the 12008 error too? (It is what I was expecting but you noted only a timeout response, so just checking it was the i/o timeout cause of the 12008 and not another timeout response.)

[1] Yes, not terribly useful unless using specific document keys already known to not reside on the problem node.

Just reproduced scenario and checked query.log in more details

2022-10-15T15:27:52.554+00:00 [Info] GSIC[default/test-bucket-_default-_default-1665846952475081799] logstats "test-bucket" {"gsi_scan_count":7,"gsi_scan_duration":21343447867,"gsi_throttle_duration":32012295,"gsi_prime_duration":23264658,"gsi_blocked_duration":16999267306,"gsi_total_temp_files":7,"gsi_backfill_size":42051}
2022-10-15T15:27:52.978+00:00 [Info] [Queryport-connpool:IP2:9101] active conns 0, free conns 1
2022-10-15T15:27:53.021+00:00 [Info] [Queryport-connpool:IP3:9101] active conns 0, free conns 1
2022-10-15T15:28:04.363+00:00 [INFO] Connection Error test-bucket : dial tcp IP:11210: i/o timeout
2022-10-15T15:28:04.363+00:00 [ERROR] {3 errors, starting with dial tcp IP:11210: i/o timeout}
2022-10-15T15:28:04.364+00:00 [Info] GSIC[default/test-bucket-_default-_default-1665846952475081799] request(6a78c864-cb25-4cf1-aebd-edb605850839) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results15354049291596 ...
2022-10-15T15:28:07.758+00:00 [INFO] Connection Error test-bucket : dial tcp IP:11210: i/o timeout
2022-10-15T15:28:07.759+00:00 [INFO] Connection Error test-bucket : dial tcp IP:11210: i/o timeout
2022-10-15T15:28:07.759+00:00 [ERROR] {2 errors, starting with dial tcp IP:11210: i/o timeout}
2022-10-15T15:28:07.759+00:00 [Info] GSIC[default/test-bucket-_default-_default-1665846952475081799] request(b3565478-167a-4bca-8697-67a3d4c04464) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results15352794603179 ...
2022-10-15T15:28:52.532+00:00 [Info] connected with 3 indexers

I can’t find 12008 error in whole query.log or anything else

test-bucket - name of the bucket
IP - IP of the stopped node
IP2, IP3 - IPs of the remained nodes

12008 is the SQL error; you will only see it in the response (code:12008,“Error performing bulk get…”, as you quoted) and not in the query.log. Did you see it with version 6.6 too, with the same response and roughly the same elapsed time? (Not strictly important, but wanted to check since you didn’t mention the error in the original but did in the 7.1 repro.)

The sort of messages in the query.log that would tell when a new cluster layout is received will look like:

2022-10-15T22:13:28.899+01:00 [INFO] Bucket updater: Trying with http://127.0.0.1:8091/pools/default/bucketsStreaming/default

and

2022-10-15T22:13:49.057+01:00 [INFO] Bucket updater: switching manifest id from 1 to 2 for bucket default

The expectation remains the same though; Query will not be able to access v-buckets on the “down” node until told by the cluster orchestrator that they can be found elsewhere (i.e. after the rebalance). Data not requiring access to the down node would remain accessible.

Thanks for clarification!

CE 6.6 and CE 7.1 have exactly the same response (provided it for 7.1 because didn’t post it earlier) without 12008 error

According Intra-Cluster Replication there is one active set of the vBuckets at the moment.

So having one of the nodes “down” we need wait until rebalance happens and all “lost” data from replicas will be in active state

But I may understand it wrong, that is why I came to the forum to get confirmation or more detailed explanation what happens before node failover/rebalance
It was expected for me some data may be “lost” until rebalance, but not whole query service is unavailable until failover/rebalance

Thanks for confirming; at least you didn’t then encounter something unexpected in 6.6.

The whole Query service is not down; you can still access data not on that down node.

The trouble is this is often very limited as you typically have to know the keys in question to be able to avoid a down v-bucket. You could also successfully run any wholly covered queries - they don’t require access to the data, only to the indices (assuming the indexing service is unaffected by the down node too).

You should be able to prove both of these simply enough. A covered query simply requires you filter and return only index key fields in your query (you can check EXPLAIN output to confirm coverage and a lack of a FETCH operator).

You can use cbc-hash to determine the v-bucket for a key, e.g.

$ /opt/couchbase/bin/cbc-hash -u Administrator -P password abc123 abc123: [vBucket=770, Index=2] Server: IP2:11210, CouchAPI: http://IP2:8092/default Replica #0: Index=0, Host=IP1:11210
and using that information insert a key that won’t be affected by your test. You can then use the USE KEYS clause of a SELECT statement to verify it is still accessible (i.e. in this case it is OK for IP1 to go down and the key should still be available); e.g.

select * from default use keys["abc123"];

Of course we operate like this deliberately to prevent inaccurate results. If we simply ignored down v-buckets then if your query was say to look-up a customer on your system it could report the customer as “unknown” when in fact they are. Better that your statement waits and/or gets an error than reports inaccurate results.

HTH.

Thank you for the detailed explanation!

Please correct me if I’m wrong (assuming prerequisite and scenario in the description):

  1. If one of the nodes is down the whole query service is not down, we’re able to get “partial” data (I said “partial” because data on failed node is not available and replication is not in active state yet) by specifying keys in query to avoid “down” vBuckets
  2. Even with the bucket replicas there is one set of the vBuckets in active state only at the time. That means in case one of the nodes fails (assuming part of the requested vBuckets are resided on this node), we need to wait failover/rebalance till orchestrator switch replicas to the active state (or rebalance re-organize vBuckets between remained nodes)
  3. Because of 2nd point, query service returns an error (timeout in my case) and this is designed behavior. The goal of it is better to wait and/or get an error instead of retrieving partial/corrupted data

So in terms of durability having bucket replicas my data is 100% save (after failover/rebalance we can retrieve whole set of the data), in terms of HA (high availability) we have kind of short “down-time” till node failover happens (either orchestrator switches replicas in active state or rebalance re-organizes vBuckets between remained nodes - vBuckets from failed node will be restored from replicas)

Am I correct?

Yes, your understanding & summary of the failed node situation is correct.

If you refer to https://docs.couchbase.com/server/current/learn/clusters-and-availability/intra-cluster-replication.html it notes “the chance of data-loss through the failure of an individual node is minimized”. So not a 100% guarantee, but minimized.

The reason for “minimized” is the writes to the replica v-buckets are not a synchronous part of the write to the active v-bucket; that is, the write completes on the active v-bucket before the change is replicated to the replica(s). This means there is a small window where a write may happen (and possibly a successful response be sent) but before it can be replicated, the node goes down. In practical terms this is unlikely to affect you in most situations.

Yes, until the automatic failover triggering criteria are met, you would be expected to encounter these data access issues which may effectively be a sort of down-time for the Query service, at least.

2 Likes

Thanks a lot for conversation and detailed explanations!