I am unable to rebalance a production Couchbase 6.5 cluster. Ive added a 7.1 node and attempted to rebalance. The first rebalance failed with:
{"completionMessage":"Rebalance stopped by janitor."}
A subsequent rebalance ran for awhile then failed.
“Rebalance exited with reason {service_rebalance_failed,index,\n {agent_died,<29443.456.0>,\n {linked_process_died,<29443.1509.0>,\n {timeout,\n {gen_server,call,\n [<29443.1507.0>,\n {call,"ServiceAPI.GetTaskList",\n #Fun<json_rpc_connection.0.102434519>},\n 60000]}}}}}.”}
The rebalance button is now disabled. Any help is greatly appreciated.
If this is an Enterprise server, please open a case with Customer Support.
Otherwise - look in the server logs for more information about the ‘agent_died’.
It might be worthwhile to try adding a 6.5 server to eliminate one variable.
Ensure all the ports are accessible: Couchbase Server Ports | Couchbase Docs
Can you please collect logs using cbcollect_info? it will give a complete picture of what is going on. https://docs.couchbase.com/server/current/manage/manage-logging/manage-logging.html should show how to do this via different options.
Thanks for your response and information. This is not an Enterprise server.
-
I’ve collected redacted logs but am in the process of clearing the organization procedures for potentially distributing them. If cleared, should I use the upload to Couchbase feature?
-
Review of logs around the agent_died
event hasn’t yielded much context to identify a root cause so far. Attached at the bottom is a section of the reports.log from a cluster node.
-
We’ve ensured all ports are accessible and there is no firewall between nodes.
Thanks again for your help.
crash-snippet.reports.log.zip (3.4 KB)
Ok. In that file I find what you first posted:
messages: [{'EXIT',<0.26359.955>,
{linked_process_died,<0.26276.955>,
{timeout,
{gen_server,call,
[<0.26083.955>,
{call,"ServiceAPI.GetTaskList",
and I search issues.couchbase.com for “linked_process_died ServiceAPI.GetTaskList”. And I find Loading... which says the issue if fixed in 7.0.0. So upgrade your existing servers to 7.1, and then add the new node.