CB Server 5.1.1 rebalance has made my server unusable

scott · January 22, 2019, 2:31pm

So we have had 2 servers in our cluster running for over 100 days, without any issues. Today I went to add a new server to the cluster.

I was able to add the server successfully, but then once that was done and I tried to rebalance, the rebalance got about 18% done, but then there were errors in the rebalance (which I unfortunately did not capture). Then one of the original servers seemed to go into a loop of warming up. Then the second server started to do the same thing.

In the end the servers both became completely unreachable running at 100% (I couldn’t even ssh in).

Now they seem to have come back to being available to the cluster, but they can’t seem to finish warming up.

I am getting a bunch of errors like this one for different buckets:

Compactor for view access/_design/main (pid [{type,view}, {name, <<“access/_design/main”>>}, {important,false}, {fa, {#Fun<compaction_new_daemon.25.86110551>, [<<“access”>>, <<"_design/main">>, {config, {30,undefined}, {30,undefined}, undefined,false,false, {daemon_config,30,131072, 20971520}}, false, {[{type,bucket}]}]}}]) terminated unexpectedly (ignoring this): {badmatch, {error, {{case_clause, {{error, vbucket_stream_not_found}, {bufsocket, #Port<11670.12123>, <<>>}}}, [{couch_dcp_client, init, 1, [{file, “/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_dcp/src/couch_dcp_client.erl”}, {line, 312}]}, {gen_server, init_it, 6, [{file, “gen_server.erl”}, {line, 304}]}, {proc_lib, init_p_do_apply, 3, [{file, “proc_lib.erl”}, {line, 239}]}]}}} hide

Plus others like this for “projector” and “indexer”:

Service ‘projector’ exited with status 134. Restarting. Messages: github.com/couchbase/indexing/secondary/projector.(*Projector).doMutationTopic(0xc4201260a0, 0xc44cc6dc20, 0xc44cc60016, 0x0, 0x0) /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/projector/projector.go:375 +0x32b fp=0xc42f0f6f18 sp=0xc42f0f6d98 github.com/couchbase/indexing/secondary/projector.(*Projector).handleRequest(0xc4201260a0, 0x11e0100, 0xc4265f5800, 0x16) /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/projector/adminport.go:94 +0x562 fp=0xc42f0f6f90 sp=0xc42f0f6f18 runtime.goexit() /home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42f0f6f98 sp=0xc42f0f6f90 created by github.com/couchbase/indexing/secondary/projector.(*Projector).mainAdminPort /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/projector/adminport.go:70 +0x8dc [goport(/opt/couchbase/bin/projector)] 2019/01/22 14:24:32 child process exited with status 134

What does this all mean - what is going on?

All the nodes are on the same version (5.1.1 build 5723). I won’t seem to come out of the warmup state.

HELP!

Thanks,
Scott

mihir.kamdar · January 23, 2019, 4:54am

Hi @scott, Is this Enterprise Edition or Community Edition ? Can you pls collect logs and share them with us ? Also, what OS are your servers and what was the size (RAM, No. of CPU) of the servers? Were there any operations ongoing at the time of rebalance?

scott · January 23, 2019, 10:59am

Hello Mihir,

It is the community edition on Ubuntu 14.04 with 8GB Ram and 2 CPUs. There were operations going on at the time of the rebalance.

Here is a link to the logs from one of the servers (have not been able to get the others) - obviously it is big ~ 256 MB).

https://media.smallcubed.com.s3.amazonaws.com/data/collectinfo-2019-01-22T130850-ns_1%40172.31.60.0.zip

Thanks for any help.
Scott

mihir.kamdar · January 24, 2019, 6:51pm

Thanks @scott . It would be great if you can provide logs from all couchbase nodes, especially 172.31.68.27, and also the new node that was being added.

scott · January 25, 2019, 10:33am

Hello Mihir,

Thanks for trying to help. However, in the end the cluster became unusable and I needed to recreate our cluster and restore from backups. So I cannot get those logs any more.

Scott

bathige · January 30, 2019, 6:06am

This happened to one of the clusters in our environment. Basically rebalance was stuck. Apparently, CB rebalance is not very robust so it is really frustrating.

Topic		Replies	Views
Failure during rebalance Couchbase Server	5	5884	July 2, 2013
Couchbase rebalance errors Couchbase Server	1	3750	October 20, 2014
Can not rebalance Couchbase Server	0	1774	June 26, 2014
Couchbase web console error & cpu load high when rebalance & rebalance failed Couchbase Server	0	958	June 17, 2019
Rebalance is stuck couchbase 4.1 Couchbase Server	0	2042	February 3, 2016

CB Server 5.1.1 rebalance has made my server unusable

Related topics