Help! Server going berserk

uris2000 · November 27, 2014, 4:52pm

Hi.

I have a 4 node system with three buckets (and no views for now - but i did do a trial with a view, and do believe I deleted it, but now trying to get the view list from the web GUI just times out on this bucket).
I started pumping data into it few weeks ago, and have few millions of documents. All was well but insertion time was not adequate (not keeping up with our data).
In attempt to improve things, I edited the main bucket (‘user_store’) I use and changed the “I/O Priority” to “High”, clicked Save and got the warning - that this can result in some downtime.

It has now been almost 2 days since!!

All nodes are showing yellow with “Pend”. CPU jumps up and down. Expanding them sometimes shows messages such as “Starting ep-engine” or “Initializing” next to the buckets.

All buckets showing yellow too, and sometimes showing those same messages over and over (“Starting ep-engine” or “Initializing”).

Also, calling this (which works fine on my staging system):

http://localhost:8091/pools/default/buckets/user_store/ddocs

returns:

["Unexpected server error, request logged."]

Any help will be appreciated. If I can’t resolve this ASAP i’ll have to delete the buckets and start fresh!

Log is full of messages such as :


    [couchdb:info,2014-11-27T11:47:13.623,ns_1@couch01.colo.com:<0.17407.264>:couch_log:info:41]Started main (prod) set view group `user_store`, group `_design/dev_main`, signature `824a6eff44708e8dce37ca0071a589a1', view count 1
    active partitions:      [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
    passive partitions:     []
    cleanup partitions:     []
    unindexable partitions: []
    no replica support
    
    [couchdb:info,2014-11-27T11:47:13.624,ns_1@couch01.colo.com:<0.17407.264>:couch_log:info:41]Flow control buffer size is 20971520 bytes
    [couchdb:error,2014-11-27T11:47:13.625,ns_1@couch01.colo.com:<0.17407.264>:couch_log:error:44]couch_set_view_group error opening set view group `_design/dev_main` (prod), signature `824a6eff44708e8dce37ca0071a589a1', from set `user_store`: {error,
                                                                                                                                                       {error,
                                                                                                                                                        {dcp_socket_connect_failed,
                                                                                                                                                     econnrefused}}}
~~~~~~~~~~~~~~~~~~~~~~~

pvarley · November 27, 2014, 6:21pm

Hi uris2000,

I assume this is Couchbase Server 3.0.1?

It sounds like after changing the “I/O Priority” to “High”, the bucket had problems restarting and warming up. It sounds like the memcached process keeps failing. Can you open an defect and upload the logs please.

Please drop a link to the defect here.

Thanks,
Patrick

uris2000 · November 27, 2014, 6:45pm

Yes, 3.0.1 - sorry forgot to mention.

I have logs collected last night - I can upload.
2 things:

When I upload the logs - do I need to upload all 4 nodes? they all show exact same pattern in the log files I saw.
The other thing is - all 3 buckets are behaving the same. Even a new bucket I created for testing yesterday became yellow and stayed that way (creating and deleting buckets still works). Tho I only modified one bucket’s priority.

pvarley · November 27, 2014, 7:00pm

Ideally all 4 but one would be a good start.

uris2000 · November 27, 2014, 7:30pm

The ZIP file is ~70MM and the upload limit is 50MB on the Dashboard site.
I tried collecting single node, expanded the zip and rezipped with hhigher compression - still at ~55MB…
Any file from the list that is not needed?

-rw-r--r-- 1 root root  11038459 Nov 27 14:24 couchbase.log
-rw-r--r-- 1 root root      2049 Nov 27 14:24 ddocs.log
-rw-r--r-- 1 root root  35650874 Nov 27 14:24 diag.log
-rw-r--r-- 1 root root     16150 Nov 27 14:24 ini.log
-rw-r--r-- 1 root root       305 Nov 27 14:24 memcached.log
-rw-r--r-- 1 root root 182965900 Nov 27 14:25 ns_server.babysitter.log
-rw-r--r-- 1 root root 187285133 Nov 27 14:25 ns_server.couchdb.log
-rw-r--r-- 1 root root 207348524 Nov 27 14:25 ns_server.debug.log
-rw-r--r-- 1 root root 184903560 Nov 27 14:25 ns_server.error.log
-rw-r--r-- 1 root root  15002860 Nov 27 14:25 ns_server.http_access.log
-rw-r--r-- 1 root root 199982734 Nov 27 14:25 ns_server.info.log
-rw-r--r-- 1 root root       231 Nov 27 14:25 ns_server.mapreduce_errors.log
-rw-r--r-- 1 root root 181907071 Nov 27 14:25 ns_server.reports.log
-rw-r--r-- 1 root root       318 Nov 27 14:25 ns_server.ssl_proxy.log
-rw-r--r-- 1 root root 197128240 Nov 27 14:25 ns_server.stats.log
-rw-r--r-- 1 root root 181879433 Nov 27 14:25 ns_server.views.log
-rw-r--r-- 1 root root       221 Nov 27 14:25 ns_server.xdcr_errors.log
-rw-r--r-- 1 root root      1493 Nov 27 14:25 ns_server.xdcr.log
-rw-r--r-- 1 root root       219 Nov 27 14:25 ns_server.xdcr_trace.log
-rw-r--r-- 1 root root      6956 Nov 27 14:25 stats.log

uris2000 · November 27, 2014, 7:56pm

Ok, I just broke it down to several ZIP files.

The issue ID is MB-12796
http://www.couchbase.com/issues/browse/MB-12796

thanks for the help.

uris2000 · November 27, 2014, 9:14pm

Issue resolved. thanks pvarley.

I wonder - if a bucket is more heavy on writes than on reads - is there anything that can be done to speed things up?

pvarley · November 27, 2014, 9:27pm

Let’s take a step back, what is the problem you are seeing? Is your disk queue too high?

uris2000 · November 28, 2014, 3:29pm

I deployed code to query data from a SQL based DB and insert (or update) documents in Couchbase.
I deployed around Nov 11, made some improvements in the following days, and around Nov 14 was running the code more or less as it is now.
It reached a pick of about 200 ops/sec but since then was in constant decline.

(I have a graph to explain but the system here won’t let me upload image).

pvarley · November 28, 2014, 8:45pm

We should really open a new question, as this is a different problem.

You can use the “reply as Linked Topic” link on the right hand side of your last post to do it for you

Topic		Replies	Views
High CPU & RAM usage on data node Couchbase Server	4	3303	May 24, 2016
Bucket is unheathy due to dropping off some lines by memcached Couchbase Server	1	1646	March 11, 2016
Couchbase node going down due to views indexing Couchbase Server	0	1332	September 21, 2016
High disk write load and flood of memcached logs Couchbase Server	7	2905	December 22, 2016
getIndexStatus failed Couchbase Server connections , 40-rc	27	8797	November 22, 2016

Help! Server going berserk

Related topics