We had an issue with a node running out of space so we failed it out of the cluster and started a rebalance. The rebalance ran for a while but now appears to be stalled. The Rebalance Progress of each Server Node stats of Total number of keys to be transferred and Estimated number of keys transferred have stopped changing and it’s been quite a while since they moved.
What is the best action to take to prove the rebalance is stalled? Is there a log file or other counters we should be looking at to ensure the rebalance is moving?
If it is stalled, what is the best action to take to get it moving again? Stop then start the rebalance?
@radleta there are couple of things that you could do here:
Stop rebalanace, wait for sometime and start it again
Are there any indexes or views defined in your setup? if so, you might want to look at their statistics (Bucket -> statistics) as well - sometimes compaction of views can stall rebalance progress
If there is a rebalance failure, generally it shows up in the UI logs -> if not, you can do a cbcollect on your cluster and then look at diag.log to see when the rebalance started and what happened exactly with timestamp comparisons
@aruns1987 Thanks for the suggestions and taking the time to respond.
We ended up having to build a second cluster and migrating the data to it then flushing the buckets to get it to finish. It appeared specific buckets had been corrupted preventing the rebalance from finishing.