Data nodes on the cluster needed frequent rebalance

We have set up a cluster with the following configuration:

  • Data Nodes: 3, with a bucket replica configuration of 2.
  • Other Nodes: 2 nodes that manage query, index, and eventing.

Although we have dedicated query nodes, our primary reliance is on kvops (Document and SubDocument operations) for data operations. According to Couchbase documentation, this method is optimal for faster operations. However, we’ve observed that the memory utilization on the node often spikes to 85%, leading to a graceful failover. To restore the cluster to its normal state, we typically need to perform a rebalance.

I have a couple of questions:

  1. Is there a configuration setting that can automatically manage and clear the memory to maintain a healthy data node?
  2. Can the cluster remain operational during rebalancing, even though it sometimes results in failures?

It’s probably best to open an issue with support to identify what’s going on there and get it resolved.

  1. Is there a configuration setting that can automatically manage and clear the memory to maintain a healthy data node?

I don’t know of any. That would likely depend on what is causing the memory spike.

  1. Can the cluster remain operational during rebalancing, even though it sometimes results in failures?

It should remain operational. It may be necessary for the SDK to retry operations (for instance, if the active document is on a different node than expected). Or when the node is removed from the cluster. But the error should be transient and the SDK should be able to recover on it’s own by retrying.

We’ve observed that after removing a node all applications continue to run however after a rebalance is done the application nodes start throwing unambiguous timeouts and we have to restart the application service in order for services to be fully restored. Is there a parameter we need to set in the client sdk (we are using Python 4.x) that will automatically recover the client connection after a rebalance?

I found an open issue PYCBC-1523 that sounds like it could be related to what you describe. If you need more assistance, please open a support case.

It sounds like this is the issue you should address first.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.