Unable to rebalance servers

I am unable to rebalance our production database. I get this error:

Also a few of the indexes go stale and I’m un able to recreate them. Dropping the bucket and recreateing the data, does not fix the indexing problem.

@elbilo “service_rebalance_failed,index,” means the error message logged here came from the Index service. Index found information on disk indicating a prior rebalance or failover or index move attempt failed and still needs to be cleaned up (e.g. the failed attempt could have created index metadata on the destination nodes of indexes that never actually got moved, so this would need to be deleted).

You can look in the indexer.log on nodes that have Index service around the timestamp of the debug.log message you showed above to see if there is more detailed information logged by Index service. E.g. if the cleanup attempts themselves are failing there may be more info logged to help diagnose why.

Some landmarks to look for in indexer.log files:

  1. “PrepareToplogyChange” messages - this is a pre-Rebalance preparation step whose call and outcome (two separate messages containing the quoted string) should be logged on all Index nodes.

  2. “StartToplogyChange” messages (call and outcome) - logged only on the Index “leader” node for the Rebalance attempt, if it got that far.

  3. “onRebalanceDoneLOCKED” - logged on all Index nodes when the Rebalance attempt ends, whether that is through success, failure, or cancellation.

Also, all messages in indexer.log containing the string “RebalanceServiceManager::” or “Rebalancer::” are from the Rebalance component, so you can grep for these.

So I tried to “rebalance” again this morning and this time I’m seeing a slightly different error.

I’m attaching a copy of indexer.log file from the starting time of the rebalance command.

rebalanceError_indexer.log.zip (7.5 KB)
This log is from the 3rd node in our cluster. I’m also attaching the indexer log from node 1.

rebalanceError_indexer_node1.log.zip (9.5 KB)

In reviewing these logs I don’t see any references to specific objects causing the problem, so I don’t see what I need to delete? or what specifically clean up.

@elbilo Rebalance runs one service at a time. The original failure you posted was due to Index service failing the Rebalance but the new failure is from Analytics service (“cbas” in the log message = Couchbase Analytics Service), whereas Index has already succeeded per the screenshot you included, so the indexer.logs will not be helpful here.

I am not familiar with the Analytics area, so hopefully someone from that team will chime in.