CB 4.0 Startup Fail (on 4 nodes independently)

digipigeon · October 13, 2015, 9:07am

I have a 4 node cluster running 4.0.0-4051 Community Edition (build-4051) each with 32GB RAM and 240GB SSD. I am using about 17% disk space without indexes and about 26GB ram per server is allocated to buckets, 28GB RAM is set for DATA and 1GB RAM for Index.

Each node is setup as Data, Index & Query.

I begun performing a new XDCR but started to get unstable nodes. I did a full restart of the cluster now it wont come back up at all, each node stays in amber state, or intermittently goes green.

I have switched off XDCR (which increased the frequency of the green nodes) as well as deleted all the production views.

I have tried restarting the cluster together and tried firing up a single machine, but with no luck.

I am not entirely sure what I am looking for in the error log, but here is a few lines which may be relevent

[ns_server:error,2015-10-13T10:57:16.769+02:00,ns_1@10.1.34.219:index_status_keeper_worker<0.385.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,
                                                         {econnrefused,
                                                          [{lhttpc_client,
                                                            send_request,1,
                                                            [{file,
                                                              "/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                             {line,220}]},
                                                           {lhttpc_client,
                                                            execute,9,
                                                            [{file,
                                                              "/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                             {line,169}]},
                                                           {lhttpc_client,
                                                            request,9,
                                                            [{file,
                                                              "/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                             {line,92}]}]}}

=========================INFO REPORT=========================
{net_kernel,{'EXIT',<0.65.2>,shutdown}}
[error_logger:info,2015-10-13T11:01:57.047+02:00,ns_1@10.1.34.219:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]

=========================INFO REPORT=========================
{net_kernel,{'EXIT',<0.1798.2>,shutdown}}
[ns_server:debug,2015-10-13T11:02:38.024+02:00,ns_1@10.1.34.219:<0.1663.2>:janitor_agent:query_vbucket_states_loop:109]Exception from query_vbucket_states of "geoip":'ns_1@10.1.34.218'
{'EXIT',{{nodedown,'ns_1@10.1.34.218'},
         {gen_server,call,
                     [{'janitor_agent-geoip','ns_1@10.1.34.218'},
                      query_vbucket_states,infinity]}}}
[ns_server:debug,2015-10-13T11:02:38.024+02:00,ns_1@10.1.34.219:<0.1663.2>:janitor_agent:query_vbucket_states_loop_next_step:114]Waiting for "geoip" on 'ns_1@10.1.34.218'
[error_logger:info,2015-10-13T11:02:38.024+02:00,ns_1@10.1.34.219:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]

==> babysitter.log <==
memcached<0.76.0>: 2015-10-13T10:41:18.276859+02:00 WARNING (record) Engine warmup is complete, request to stop loading remaining database

Any help would be much appericiated

fabien · February 8, 2016, 2:54pm

A process cannot connect to port 9102 opened by the indexer. Those are the ports that, by default, should be opened by the indexer:
netstat -lnp | grep indexer tcp6 0 0 :::9100 :::* LISTEN 30575/indexer tcp6 0 0 :::9101 :::* LISTEN 30575/indexer tcp6 0 0 :::9102 :::* LISTEN 30575/indexer

In my case it was because the indexer could not talk to the projector that was not started because port 9999 was already in use (so projector could not start and indexer was trying to talk to that other service using port 9999):
netstat -lnp | grep 9999 tcp6 0 0 :::9999 :::* LISTEN 30566/projector

In /opt/couchbase/var/lib/couchbase/logs/projector.log I could actually notice:
[Error] pram[:9999] listen failed listen tcp :9999: bind: address already in use

Note: in my case, the problem was with a simple node and the N1QL query to create a primary index timing out. Not sure that solving your indexer issue will solve your problem, but it will at least make logs easier to read

Topic		Replies	Views
Couchbase 4.0 nodes keep FAILING Couchbase Server	2	1604	February 11, 2016
CRITICAL: Couchbase Cluster Stuck in Rebalance Couchbase Server	0	1429	July 20, 2017
Periodic Node Failure - Where to start debugging? Couchbase Server	0	736	January 11, 2018
Mac OS X Installation Issue - Server is not coming up Couchbase Server	8	4895	October 17, 2016
XDCR fails to resume on one node after network failure Couchbase Server	0	1359	August 29, 2017

CB 4.0 Startup Fail (on 4 nodes independently)

Related topics