We are really sorry to hear that your cluster got corrupted and glad that you recovered yourself out of peril.
This could happen if there were concurrent index definition updates happening in your system while the rebalance were in progress. For example, operations like any index definition Creates/Updates/Deletes during a rebalance operation Or even many threads performing concurrent index CUD operations from different nodes parallelly around the same timestamp could result in this situation.
These concurrent updates could result in inconsistencies in the eventual partition-node layout plan while resolving the conflicts on the layout plans from the distributed planners in the system.
Was that the scenario in your situation? If it happened in the above situation, we are working on fixing this in the latest upcoming software version. We are also thinking about a better way to unblock the cluster with minimum damage (skip index rebuild) if such a situation occurs in the next release.
Any feedback on your cluster context/cbcollect info would be helpful for us.
Thanks for your quick answer. And no problem, that’s the goal of test clusters, finding problems and learning new things
We already opened a support ticket (# 38792) with collected logs, but while this is not really a enterprise ticket (no licenses for the test cluster) we tried to fix the cluster ourselves
It was just a replace of all nodes, one by one, so nobody did CUD operations on index definitions. But while this was done with some scripts maybe something went wrong it tried to add/remove a node during a running rebalance or so.
Just to be sure: if that happens again when we update the production cluster: the way we fixed that is the recommended way and all should be good then? No hidden problems when we do that?
FTS occasionally print out the partition node layout plan into the logs and this could be used to resurrect the cluster with minimum service outage.
If you trace up from the “hash mis match” error in the FTS logs, then you could see logs similar
[INFO] cfg_metakv_lean: setLeanPlan, val: "large plan contents in json}
[INFO] cfg_metakv_lean: setLeanPlan, curMetaKvPlanKey set, val: {"path":"/fts/cbgt/cfg/planPIndexesLean/planPIndexesLean-4e6f3436c9042a1c8eb3948bd6079079-1615465985291/","uuid":"414de85530a0cc7f","implVersion":"5.5.0"}
If you reset the value of the “curMetaKvPlanKey” to the immediate value before the hash mismatch error, it should all work well without any index partition builds.
eg:
But if you happen to try this at a later point since the occurrence of the error, then resetting that to an empty value as you did now would be recommended.
Having said this, this is a distributed system with many moving parts like the index definitions, node definition changes as it happened in your cluster. So, points of failures are aplenty.
We are working to improve the robustness here in next release.
The error cfg_metakv_lean: getLeanPlan, hash mismatch between plan hash has been addressed with the 6.6.2 release of couchbase server. It is recommended to upgrade to this version to avoid the error.