[MB-5020] Rebalance state incorrectly reported as running even when it's not and user is unable to stop it or fail over/add nodes Created: 05/Apr/12 Updated: 13/May/12 Resolved: 05/Apr/12 |
|
| Status: | Resolved |
| Project: | Couchbase Server |
| Component/s: | ns_server, RESTful-APIs |
| Affects Version/s: | 1.8.0 |
| Fix Version/s: | 1.8.1 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Aleksey Kondratenko | Assignee: | Aliaksey Artamonau |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 1.8.1-release-notes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
One of customers had master node fail in the middle of rebalance. As a result rebalance was actually aborted, but ns_config flag that marks rebalance as running was still there.
What's most notable is that we're not allowing many actions in UI while rebalancing. So UI was incorrectly thinking that rebalance is running and not allowing that broken node to be failed over. Stop rebalance wasn't actually working as well because rebalance wasn't really running. Customer had to manually reset rebalance state via /diag/eval snippet that sets rebalance_state config variable. I've recommended something like that: ns_config:set(rebalance_status, {node, <<"stopped by human">>}). It's notable that 1.8.0 actually have code to clean up stale rebalance status, but it is only triggered when all nodes are healthy, which was not holding in this customer's case. So decision was to actually clear rebalance status when asked, but to warn user if our orchestrator is clearly not running rebalance because network partition may actually mean that some other network partition still has old orchestrator that tries to run rebalance. |
| Comments |
| Comment by Aleksey Kondratenko [ 05/Apr/12 ] |
| Fix merged as a bunch of commits |
| Comment by Thuan Nguyen [ 05/Apr/12 ] |
|
Integrated in github-ns-server-2-0 #329 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/329/]) Store rebalancer PID in config. Drop rebalance status even when rebalance isn't running. Add stopRebalanceIsSafe to pool details. Warn user on unsafe rebalance stop attempt. Result = SUCCESS Aliaksey Kandratsenka : Files : * src/ns_janitor.erl * src/ns_orchestrator.erl Aliaksey Kandratsenka : Files : * src/ns_janitor.erl * src/ns_orchestrator.erl Aliaksey Kandratsenka : Files : * src/ns_cluster_membership.erl * src/menelaus_web.erl Aliaksey Kandratsenka : Files : * priv/public/js/servers.js * priv/public/index.html |