[MB-6992] rebalance hangs after failing over disconnected node Created: 23/Oct/12  Updated: 10/Jan/13  Resolved: 26/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Tommie McAfee Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build 10.3.3.59

Attachments: GZip Archive 10.3.3.59.debug.tar.gz     GZip Archive 10.3.3.60.debug.tar.gz    

 Description   
1 node went down while loading data on 22 node cluster. (possibly related to xen-hypervisor as it could not ping gateway and network interface needed to be restarted)
While node was down I tried to fail it over and rebalance.
However, rebalance never completes and looks like there is no rebalance activity occuring on tap.


Some activity seen in logs at time of node down:

10.3.3.59 sees .60 nodedown :

[user:warn,2012-10-22T11:06:38.896,ns_1@10.3.3.59:ns_node_disco:ns_node_disco:handle_info:168]Node 'ns_1@10.3.3.59' saw that node 'ns_1@10.3.3.60' went down.

at the same time stamp node .60 shows:

[ns_server:error,2012-10-22T11:06:00.350,ns_1@10.3.3.60:<0.12281.36>:ns_janitor:cleanup_with_states:84]Bucket "default" not yet ready on ['ns_1@10.3.2.84','ns_1@10.3.2.
85',
                                   'ns_1@10.3.2.110','ns_1@10.3.2.111',
                                   'ns_1@10.3.2.112','ns_1@10.3.2.113',
                                   'ns_1@10.3.2.114','ns_1@10.3.2.115',
                                   'ns_1@10.3.3.59','ns_1@10.3.3.62',
                                   'ns_1@10.3.3.65','ns_1@10.3.3.66',
                                   'ns_1@10.3.3.69','ns_1@10.3.3.70',
                                   'ns_1@10.3.121.90','ns_1@10.3.121.91',
                                   'ns_1@10.3.2.107','ns_1@10.3.2.108',
                                   'ns_1@10.3.2.109']
[ns_server:debug,2012-10-22T11:06:07.388,ns_1@10.3.3.60:<0.12508.36>:janitor_agent:new_style_query_vbucket_states_loop:116]Exception from query_vbucket_states of "defau
lt":'ns_1@10.3.2.85'
{'EXIT',{{nodedown,'ns_1@10.3.2.85'},
         {gen_server,call,
                     [{'janitor_agent-default','ns_1@10.3.2.85'},
                      query_vbucket_states,infinity]}}}


 Comments   
Comment by Tommie McAfee [ 23/Oct/12 ]
Also, when Jin took a look he noticed modes nodes are at 100% rebalance status, however 10.3.2.112 is still at 63% and found the following CRASH report was seen on that host.

[couchdb:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:couch_view:couch_log:error:42]Exit on non-updater process: config_change
[couchdb:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:couch_set_view:couch_log:error:42]Exit on non-updater process: config_change
[error_logger:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: couch_server:init/1
    pid: <0.216.0>
    registered_name: couch_server
    exception exit: {function_clause,
                        [{couch_server,'-terminate/2-fun-0-',
                             [{<<"_replicator">>,<0.444.0>}]},
                         {lists,foreach,2},
                         {gen_server,terminate,6},
                         {proc_lib,init_p_do_apply,3}]}
      in function gen_server:terminate/6
    ancestors: [couch_primary_services,couch_server_sup,cb_couch_sup,
                  ns_server_cluster_sup,<0.60.0>]
    messages: []
    links: [<0.230.0>,<0.444.0>,<0.211.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 2584
    stack_size: 24
    reductions: 4530
  neighbours:
Comment by Mike Wiederhold [ 23/Oct/12 ]
Alk,

Replication building is completed on 10.3.2.112 but ns_server has not killed the tap stream. Please take a look at this and let me know if there is anything else I can look at.

Mike-Wiederholds-MacBook-Pro:ep-engine mikewied$ management/cbstats 10.3.2.112:11210 tap | egrep replication_building
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_log_size: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_seqno: 359937
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_window_full: false
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':backfill_completed: true
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':backfill_start_timestamp: 1350926553
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_jobs_completed: 166602
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_jobs_issued: 166602
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_result_size: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':connected: true
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':created: 39985
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':flags: 85 (ack,backfill,vblist,checkpoints)
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':has_queued_item: false
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':idle: true
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':paused: 1
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_backfill: false
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_disconnect: false
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_disk_backfill: false
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen_high_pri: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen_low_pri: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_backfillremaining: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_backoff: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_drain: 359877
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_fill: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_itemondisk: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_memory: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':rec_fetched: 193330
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':recv_ack_seqno: 359936
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':reserved: 1
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':seqno_ack_requested: 359936
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':supports_ack: true
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':suspended: false
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':total_backlog_size: 0
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':total_noops: 3963
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':type: producer
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':vb_filter: { 494 }
 eq_tapq:replication_building_494_'ns_1@10.3.2.108':vb_filters: 1
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_log_size: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_seqno: 97912
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_window_full: false
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':backfill_completed: true
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':backfill_start_timestamp: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_jobs_completed: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_jobs_issued: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_result_size: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':connected: true
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':created: 39985
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':flags: 85 (ack,backfill,vblist,checkpoints)
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':has_queued_item: false
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':idle: true
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':paused: 1
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_backfill: false
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_disconnect: false
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_disk_backfill: false
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen_high_pri: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen_low_pri: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_backfillremaining: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_backoff: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_drain: 97854
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_fill: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_itemondisk: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_memory: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':rec_fetched: 97909
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':recv_ack_seqno: 97911
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':reserved: 1
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':seqno_ack_requested: 97911
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':supports_ack: true
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':suspended: false
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':total_backlog_size: 0
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':total_noops: 3964
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':type: producer
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':vb_filter: { 494 }
 eq_tapq:replication_building_494_'ns_1@10.3.3.65':vb_filters: 1
Comment by Tommie McAfee [ 24/Oct/12 ]
All logs here:
https://docs.google.com/folder/d/0B48zH0LlqjKrNm9UYlhuNzZjSnM/edit

This behavior was noticed during the 1.8.1 use case testing in 2.0 build 1870
http://hub.internal.couchbase.com/confluence/display/QA/2.0+System+Test+Specs#2.0SystemTestSpecs-SystemTest%3A1.8.1usecase%28PlumWorkloadv1.0%29
Comment by Aleksey Kondratenko [ 24/Oct/12 ]
BTW, I don't have a convenient way to download all logs from google drive.
Comment by Aleksey Kondratenko [ 24/Oct/12 ]
We're stuck waiting for checkpoint 67 to be persisted on vbucket 494 on node .65.

         {<17848.6247.34>,
          [{registered_name,[]},
           {status,waiting},
           {initial_call,{proc_lib,init_p,3}},
           {backtrace,
               [<<"Program counter: 0x00007f5e10ee9e48 (gen:do_call/4 + 576)">>,
                <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
                <<"0x00007f5e0b871858 Return addr 0x00007f5e10f68870 (gen_server:call/3 + 128)">>,
                <<"y(0) #Ref<0.0.50.98207>">>,
                <<"y(1) 'ns_1@10.3.3.65'">>,<<"y(2) []">>,
                <<"y(3) infinity">>,
                <<"y(4) {wait_for_checkpoint_persistence,494,67}">>,
                <<"y(5) '$gen_call'">>,<<"y(6) <0.2607.0>">>,<<>>,
                <<"x00007f5e0b871898 Return addr 0x00007f5e06f9e228 (janitor_agent:do_wait_checkpoint_persist">>,
                <<"y(0) infinity">>,
                <<"y(1) {wait_for_checkpoint_persistence,494,67}">>,
                <<"y(2) 'ns_memcached-default'">>,
                <<"y(3) Catch 0x00007f5e10f68870 (gen_server:call/3 + 128)">>,
                <<>>,

(thats janitor_agent child on .65 doing that call)

And stats indicate that it's actually persisted:

 {<<"vb_494:persisted_checkpoint_id">>,<<"92">>},
 {<<"vb_494:checkpoint_extension">>,<<"false">>},
 {<<"vb_494:num_items_for_persistence">>,<<"0">>},
 {<<"vb_494:num_checkpoints">>,<<"1">>},
 {<<"vb_494:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_494:num_checkpoint_items">>,<<"1">>},
 {<<"vb_494:num_tap_cursors">>,<<"0">>},
 {<<"vb_494:last_closed_checkpoint_id">>,<<"92">>},
 {<<"vb_494:open_checkpoint_id">>,<<"93">>},
Comment by Mike Wiederhold [ 24/Oct/12 ]
Chiyoung,

This is an issue with the new checkpoint persistence command you added. I think we should add a stat for checkpoint persistence commands in progress. For example a stat that is incremented when we receive a checkpoint persistence request and is decremented when the request completes.
Comment by Chiyoung Seo [ 24/Oct/12 ]
I dumped connections stats from memcached layer in 10.3.3.65 and found one connection that is still in ewouldblock state and corresponds to vb_checkpoint_persistence command

STAT conn 0x58cf080
STAT socket 83
STAT protocol binary
STAT transport TCP
STAT nevents 20
STAT sasl_conn 0xcbaf5390
STAT state conn_nread
STAT substate bin_reading_packet
STAT registered_in_libevent 0
STAT ev_flags 12
STAT which 2
STAT rbuf 0x58d0000
STAT rcurr 0x58d0020
STAT rsize 2048
STAT rbytes 0
STAT wbuf 0x58e7800
STAT wcurr 0x58fb000
STAT wsize 2048
STAT wbytes 24
STAT write_and_go 0x4104f0
STAT write_and_free (nil)
STAT ritem 0x58d0020
STAT rlbytes 0
STAT item (nil)
STAT store_op 0
STAT sbytes 0
STAT iov 0x58e0800
STAT iovsize 400
STAT iovused 0
STAT msglist 0x58d5440
STAT msgsize 10
STAT msgused 1
STAT msgcurr 0
STAT msgbytes 0
STAT ilist 0x58d9100
STAT isize 200
STAT icurr 0x58d9100
STAT ileft 0
STAT suffixlist 0x58539a0
STAT suffixsize 20
STAT suffixcurr 0x58539a0
STAT suffixleft 0
STAT noreply 0
STAT refcount 1
STAT dynamic_buffer.buffer (nil)
STAT dynamic_buffer.size 2048
STAT dynamic_buffer.offset 24
STAT engine_storage 0xcd6fb0a0
STAT cas 0
STAT cmd 177
STAT opaque 0
STAT keylen 0
STAT list_state 0
STAT next (nil)
STAT thread 0x10c55f0
STAT aiostat 0
STAT ewouldblock 1
STAT tap_iterator (nil)

I'm further debugging it now.
Comment by Chiyoung Seo [ 24/Oct/12 ]
Tommie,

As we saw, the rebalance out 10.3.3.65 was successful. While debugging this issue, I had some issues in .65 while doing GETs for non-resident items, which puts my connection in ewouldblock state in memcached layer. I will continue to debug this issue.

Please update the bug if you see the same rebalance hung issue again.
Comment by Thuan Nguyen [ 26/Oct/12 ]
Integrated in github-ep-engine-2-0 #451 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/451/])
    MB-6992 Add more informative logs to checkpoint prioritization (Revision 3c719d47ca41285bbcbc61817f719180448f1042)

     Result = SUCCESS
Chiyoung Seo :
Files :
* src/ep.cc
Comment by Chiyoung Seo [ 26/Oct/12 ]
http://review.couchbase.org/#/c/22022/
Comment by Thuan Nguyen [ 26/Oct/12 ]
Integrated in github-ep-engine-2-0 #452 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/452/])
    MB-6992 Control the flusher execution by the transaction size (Revision b327be09a1f971145fda5c249b4fa7a8304b1920)

     Result = SUCCESS
Chiyoung Seo :
Files :
* tests/ep_testsuite.cc
* src/ep.cc
* src/ep.hh
* src/flusher.hh
* src/flusher.cc
Generated at Thu Nov 27 13:42:36 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.