Rebalance required for cluster map to be refreshed

We have three query nodes and around 12 data nodes in the operation cluster. Our query APIs are deployed on OpenShift, we are using Spring Data 4.x. We triggered JMETER with all the nodes up. Everything went fine. Then we started the chaos testing as follows

  1. Shut Down 1 node
  2. Trigger JMETER testing
  3. Some of the queries started failing, since I think the cluster map was not refreshed yet. More problem in subsequent steps
  4. We shut down 1 more node
  5. Auto failover did not kick in
  6. We manually did auto failover, but did not do rebalance
  7. Pods continued to fail and we could not do any query requests
  8. We then did the rebalance
  9. After that the pods came up normally and queries started to be served

Ideally, rebalance is not required to be done for serving the queries. So, we are not sure why rebalance was required. Also why auto failover did not kick in automatically

Do we know if there are any known issues with Spring Data connector.

Hi @rajib761. How are you shutting down the nodes? Unless you “rebalance out” a node, the indexes that were on it do not move to other nodes, so if there were no replicas on other nodes those indexes become unavailable.

I agree with what Kevin mentions-- have a look at the docs on index replicas.

Spring Data uses the Java SDK underneath and the scenarios you describe are tested regularly with clusters of various configurations (with/without redundancy) to ensure the right behavior. This is done for every release. Any known issues we have are in the release notes or on issues.couchbase.com, but I can pretty reliably say there aren’t any known issues that would exhibit the behavior you’re observing.

Hope that helps!

Hi Kevin, we are killing the nodes as part of chaos testing. My understanding was that a rebalance is not immediately required when a node goes down. As soon as the auto fail over happens, the cluster map is refreshed(default 2.5 seconds) and the cluster will be available for data access. Production support can then take time to investigate the issue and rebalance the cluster later. Is not my understanding correct?

Thanks Matt. i guess the problem may be happening because of what Kevin mentioned. I have requested for a follow up clarification. i will wait for Kevin’s reply

@rajib761 If you are just killing the node, then any index or index replicas on that node are gone and will not be rebuilt until a manual rebalance is done. If there is any surviving Index node that contained a replica of an index that was on the killed node, that index remains available from the replica, and the rebalance will also rebuild the lost replica on another Index node if there is an available one in the cluster that does not already have that index or a replica of it (this is called “replica repair”). Couchbase Server will not rebuild a replica on a node that already has a copy of the same index as it won’t put multiple copies of the same index on the same node, since this would have no availability benefit but just consumes extra resources.

If the killed node hosted any indexes that had no replicas on any surviving Index node, then the metadata for that index is lost so you will need to issue a “create index” statement on a surviving Index node to recreate it.

Regarding the Autofailover feature, if the node you killed had Index Service but did not also have Data Service, then Autofailover will not occur for Index Service because currently this feature is actually driven by Data Service. This results in the behavior that if Index + Data are co-located on a node, then both Index and Data for that node will Autofailover, but if Index is on the node without Data, it will not Autofailover Index. This is a known limitation and in fact I am currently personally working on implementing Autofailover for Index service even when not co-located with Data, but this is still in the pipeline for a future release (and boilerplate disclaimer: until it is officially announced Couchbase does not guarantee if/when it will become available, but it is definitely a frequently requested enhancement).

Note that even with Autofailover, for an index to survive the death of a node there has to be at least one surviving Index node that already had that index on it (main or replica – there is really no difference between these as they are all fully read-write active). If no surviving Index nodes had main or replica of that index, then the metadata for that index is no longer available and again you have to issue a manual “create index” to recreate the index.

Hope this helps!

@rajib761 Also note that Couchbase does not actually recommend co-locating Data and Index service on a node, because both of these services are heavy consumers of both memory and CPU. Query + Index on the same node is a more copacetic combination.

Hi Kevin, thanks a lot for the detailed explanation. This is very clear now. My index services are seperate, but i have only one replica. I think to survive two nodes going down I will need to bump the index replica to two. I also did not understand why the auto-failover is a limitation for index service. There is no concept of active and replica vbucket for index. Both main and replica indexes are read-write, so why would a auto failover be required. We just need to bring the index node up again and the index will get automatically built in that node since a copy of that index is already there in another node.

Are you hinting at auto-failover for index when there are no replicas for index?

@rajib761

My index services are seperate, but i have only one replica. I think to survive two nodes going down I will need to bump the index replica to two.

That is correct. The service guarantee is if you have N replicas of an index, that index will survive the loss of N nodes. (Note N = 3 is the maximum number supported.) You also need to have at least N + 1 Index nodes (master + N replicas) or not all N replicas will be created.

I also did not understand why the auto-failover is a limitation for index service. There is no concept of active and replica vbucket for index. Both main and replica indexes are read-write, so why would a auto failover be required.

Autofailover will take the failing-over nodes out of the cluster, so it has to have a way of determining when it should do this. This is a trickier problem than it seems on its face, e.g. if the node becomes unresponsive for a while, how long should Cluster Manager wait to hear from the node before it fails it over? If this is too short, it could trigger a failover of a node that was just experiencing a temporary spike in workload. This will then make the problem worse instead of better, because the remaining nodes need to handle the workload of the failed-over nodes in addition to the workloads they were already handling, making it more likely they get overloaded and unresponsive, thus triggering a cascade of auto-failovers.

So there is input into the Autofailover decision from the individual service. This has not been implemented yet by the Index service. Also we want to put in place some shock absorbers that will be able to absorb transient workload spikes without making the service unresponsive to health checks from Cluster Manager that are used in the decision whether to automatically fail over. Currently Autofailover is done at a node level instead of a service level, which is why if KV and Index are on the same node, if KV decides it needs to autofailover the side effect is it also causes Index to autofailover, as KV is considered the priority service in this situation. We have had some discussions on whether to change Autofailover to be done at the service level instead of at the node level product-wide, but this will be a ways out if it happens. Conceptually this makes more sense, as one service may be unhealthy while another on the same node is not, as not all failures are node-level like loss of power to the node.

Are you hinting at auto-failover for index when there are no replicas for index?

No, the number of replicas has nothing to do with the decision to autofailover – it only impacts what indexes actually survive the failover. If no indexes have any replicas, losing any Index node will also lose all the indexes it contains, and they will need to be recreated manually via new “create index” statements. The reason for this index loss is that each Index node only has the metadata describing the indexes it hosts (whether the main one or a replica). It is possible we could enhance the metadata handling so that all Index nodes have the metadata for all indexes in the entire cluster in order to eliminate the need to manually recreate any indexes if at least one Index node survives. There have been some discussions about this but this would also be a ways out in the future if it happens.

Thanks a lot Kevin, this is a very detailed clarification. My concepts are more clear now on this topic. I have been involved with many Couchbase implementations but never had this much of clarification before. Every new Couchbase engagement teaches me something more:)

1 Like

@rajib761 Thank you, glad I was able to help.