[MB-4517] Rebalance Stuck on 1.8 after adding nodes (rebalance gets stuck even if ack_seqno is correct and has_queued_item is true and total_backlog_size > 1000) Created: 07/Dec/11  Updated: 09/Jan/13  Due: 23/Dec/11  Resolved: 20/Feb/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 1.8.0
Fix Version/s: 1.8.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Karan Kumar (Inactive) Assignee: Karan Kumar (Inactive)
Resolution: Fixed Votes: 0
Labels: 1.8.0-release-notes, customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive ns-diag-20111207190443.txt.zip     PNG File Screen Shot 2011-12-07 at 10.58.41 AM.png     PNG File Screen Shot 2011-12-07 at 10.59.10 AM.png     PNG File Screen Shot 2011-12-07 at 11.01.17 AM.png    
Issue Links:
Duplicate
is duplicated by MB-4367 rebalance gets stuck even if ack_seqn... Closed

 Description   
Added 5 nodes to a 10 nodes cluster.

TAP stats from the node that has one tap connection stuck.
/opt/couchbase/bin/cbstats localhost:11210 tap |grep rebalance

 eq_tapq:rebalance_725:ack_log_size: 248
 eq_tapq:rebalance_725:ack_playback_size: 248
 eq_tapq:rebalance_725:ack_seqno: 58249
 eq_tapq:rebalance_725:ack_window_full: false
 eq_tapq:rebalance_725:backfill_completed: false
 eq_tapq:rebalance_725:bg_backlog_size: 0
 eq_tapq:rebalance_725:bg_jobs_completed: 43790
 eq_tapq:rebalance_725:bg_jobs_issued: 43790
 eq_tapq:rebalance_725:bg_queue_size: 0
 eq_tapq:rebalance_725:bg_queued: 43790
 eq_tapq:rebalance_725:bg_result_size: 0
 eq_tapq:rebalance_725:bg_results: 0
 eq_tapq:rebalance_725:bg_wait_for_results: false
 eq_tapq:rebalance_725:complete: false
 eq_tapq:rebalance_725:connected: true
 eq_tapq:rebalance_725:created: 5423
 eq_tapq:rebalance_725:empty: false
 eq_tapq:rebalance_725:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
 eq_tapq:rebalance_725:has_item: false
 eq_tapq:rebalance_725:has_queued_item: true
 eq_tapq:rebalance_725:idle: false
 eq_tapq:rebalance_725:num_tap_nack: 0
 eq_tapq:rebalance_725:num_tap_tmpfail_survivors: 0
 eq_tapq:rebalance_725:paused: 1
 eq_tapq:rebalance_725:pending_backfill: false
 eq_tapq:rebalance_725:pending_disconnect: false
 eq_tapq:rebalance_725:pending_disk_backfill: false
 eq_tapq:rebalance_725:qlen: 0
 eq_tapq:rebalance_725:qlen_high_pri: 0
 eq_tapq:rebalance_725:qlen_low_pri: 1
 eq_tapq:rebalance_725:queue_backfillremaining: 0
 eq_tapq:rebalance_725:queue_backoff: 0
 eq_tapq:rebalance_725:queue_drain: 58240
 eq_tapq:rebalance_725:queue_fill: 0
 eq_tapq:rebalance_725:queue_itemondisk: 0
 eq_tapq:rebalance_725:queue_memory: 0
 eq_tapq:rebalance_725:rec_fetched: 14710
 eq_tapq:rebalance_725:recv_ack_seqno: 58000
 eq_tapq:rebalance_725:reserved: 1
 eq_tapq:rebalance_725:seqno_ack_requested: 58000
 eq_tapq:rebalance_725:supports_ack: true
 eq_tapq:rebalance_725:suspended: false
 eq_tapq:rebalance_725:total_backlog_size: 10327
 eq_tapq:rebalance_725:total_noops: 836
 eq_tapq:rebalance_725:type: producer
 eq_tapq:rebalance_725:vb_filter: { 725 }
 eq_tapq:rebalance_725:vb_filters: 1


/opt/couchbase/bin/cbstats localhost:11210 all |grep mem
 ep_diskqueue_memory: 0
 ep_mem_high_wat: 7864320000
 ep_mem_low_wat: 6291456000
 mem_used: 6270355533
 vb_active_ht_memory: 25611040
 vb_active_itm_memory: 4777690775
 vb_active_perc_mem_resident: 32
 vb_active_queue_memory: 0
 vb_pending_ht_memory: 0
 vb_pending_itm_memory: 0
 vb_pending_perc_mem_resident: 0
 vb_pending_queue_memory: 0
 vb_replica_ht_memory: 17336480
 vb_replica_itm_memory: 1221834713
 vb_replica_perc_mem_resident: 11
 vb_replica_queue_memory: 0

 Comments   
Comment by Karan Kumar (Inactive) [ 07/Dec/11 ]
Server error during processing: ["web request failed",
{path,"//pools/default/buckets/default"},
{type,error},
{what,{case_clause,rebalance_running}},
{trace,
[{menelaus_web_buckets,
handle_bucket_delete,3},
{menelaus_web,loop,3},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]}]

Comment by Farshid Ghods (Inactive) [ 07/Dec/11 ]
unrelated.
this says somebody was trying to delete a bucket when rebalance was still running ( probably your test script )
Comment by Karan Kumar (Inactive) [ 07/Dec/11 ]
Attaching the Diagnostic report
Comment by Karan Kumar (Inactive) [ 11/Dec/11 ]
On 180r-36 with chiyoung and mark fixes:-
Rebalance is still stuck.

 eq_tapq:rebalance_880:ack_log_size: 6
 eq_tapq:rebalance_880:ack_playback_size: 6
 eq_tapq:rebalance_880:ack_seqno: 61327
 eq_tapq:rebalance_880:ack_window_full: false
 eq_tapq:rebalance_880:backfill_completed: false
 eq_tapq:rebalance_880:bg_backlog_size: 0
 eq_tapq:rebalance_880:bg_jobs_completed: 44109
 eq_tapq:rebalance_880:bg_jobs_issued: 44109
 eq_tapq:rebalance_880:bg_queued: 44109
 eq_tapq:rebalance_880:bg_result_size: 0
 eq_tapq:rebalance_880:bg_results: 0
 eq_tapq:rebalance_880:bg_wait_for_results: false
 eq_tapq:rebalance_880:complete: false
 eq_tapq:rebalance_880:connected: true
 eq_tapq:rebalance_880:created: 6501
 eq_tapq:rebalance_880:empty: false
 eq_tapq:rebalance_880:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
 eq_tapq:rebalance_880:has_item: false
 eq_tapq:rebalance_880:has_queued_item: true
 eq_tapq:rebalance_880:idle: false
 eq_tapq:rebalance_880:num_tap_nack: 0
 eq_tapq:rebalance_880:num_tap_tmpfail_survivors: 0
 eq_tapq:rebalance_880:paused: 1
 eq_tapq:rebalance_880:pending_backfill: false
 eq_tapq:rebalance_880:pending_disconnect: false
 eq_tapq:rebalance_880:pending_disk_backfill: false
 eq_tapq:rebalance_880:qlen: 0
 eq_tapq:rebalance_880:qlen_high_pri: 0
 eq_tapq:rebalance_880:qlen_low_pri: 1
 eq_tapq:rebalance_880:queue_backfillremaining: 0
 eq_tapq:rebalance_880:queue_backoff: 0
 eq_tapq:rebalance_880:queue_drain: 61454
 eq_tapq:rebalance_880:queue_fill: 0
 eq_tapq:rebalance_880:queue_itemondisk: 0
 eq_tapq:rebalance_880:queue_memory: 0
 eq_tapq:rebalance_880:rec_fetched: 17488
 eq_tapq:rebalance_880:recv_ack_seqno: 61320
 eq_tapq:rebalance_880:reserved: 1
 eq_tapq:rebalance_880:seqno_ack_requested: 61320
 eq_tapq:rebalance_880:supports_ack: true
 eq_tapq:rebalance_880:suspended: false
 eq_tapq:rebalance_880:total_backlog_size: 2321
 eq_tapq:rebalance_880:total_noops: 1031
 eq_tapq:rebalance_880:type: producer
 eq_tapq:rebalance_880:vb_filter: { 880 }
 eq_tapq:rebalance_880:vb_filters: 1
Comment by Dipti Borkar [ 16/Dec/11 ]
There is a workaround for this problem. Stop, wait for 10 mins and start rebalance.
Comment by Karan Kumar (Inactive) [ 20/Dec/11 ]
Again encountered during large scale testing, this has large queue_backfillremaining, seems to be similar to the one we have been seeing:-

 /opt/membase/bin/mbstats localhost:11210 tap |grep rebalance
 eq_tapq:rebalance_393:ack_log_size: 9730
 eq_tapq:rebalance_393:ack_playback_size: 9730
 eq_tapq:rebalance_393:ack_seqno: 18149
 eq_tapq:rebalance_393:ack_window_full: false
 eq_tapq:rebalance_393:backfill_completed: false
 eq_tapq:rebalance_393:bg_backlog_size: 0
 eq_tapq:rebalance_393:bg_jobs_completed: 48108
 eq_tapq:rebalance_393:bg_jobs_issued: 48108
 eq_tapq:rebalance_393:bg_queue_size: 0
 eq_tapq:rebalance_393:bg_queued: 48108
 eq_tapq:rebalance_393:bg_result_size: 34962
 eq_tapq:rebalance_393:bg_results: 0
 eq_tapq:rebalance_393:bg_wait_for_results: false
 eq_tapq:rebalance_393:complete: false
 eq_tapq:rebalance_393:connected: true
 eq_tapq:rebalance_393:created: 78383
 eq_tapq:rebalance_393:empty: false
 eq_tapq:rebalance_393:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
 eq_tapq:rebalance_393:has_item: true
 eq_tapq:rebalance_393:has_queued_item: true
 eq_tapq:rebalance_393:idle: false
 eq_tapq:rebalance_393:num_tap_nack: 2085
 eq_tapq:rebalance_393:num_tap_tmpfail_survivors: 2085
 eq_tapq:rebalance_393:paused: 0
 eq_tapq:rebalance_393:pending_backfill: false
 eq_tapq:rebalance_393:pending_disconnect: false
 eq_tapq:rebalance_393:pending_disk_backfill: false
 eq_tapq:rebalance_393:qlen: 2442
 eq_tapq:rebalance_393:qlen_high_pri: 0
 eq_tapq:rebalance_393:qlen_low_pri: 1
 eq_tapq:rebalance_393:queue_backfillremaining: 37404
 eq_tapq:rebalance_393:queue_backoff: 2085
 eq_tapq:rebalance_393:queue_drain: 18140
 eq_tapq:rebalance_393:queue_fill: 0
 eq_tapq:rebalance_393:queue_itemondisk: 0
 eq_tapq:rebalance_393:queue_memory: 33360
 eq_tapq:rebalance_393:rec_fetched: 4998
 eq_tapq:rebalance_393:recv_ack_seqno: 8418
 eq_tapq:rebalance_393:reserved: 1
 eq_tapq:rebalance_393:supports_ack: true
 eq_tapq:rebalance_393:suspended: false
 eq_tapq:rebalance_393:total_backlog_size: 37405
 eq_tapq:rebalance_393:total_noops: 0
 eq_tapq:rebalance_393:type: producer
 eq_tapq:rebalance_393:vb_filter: { 393 }
 eq_tapq:rebalance_393:vb_filters: 1

netstat -ntp
tcp 252160 257024 10.6.117.231:33624 10.6.117.231:11210 ESTABLISHED 28275/beam.smp


(gdb) t a a bt

Thread 14 (Thread 0x7f69109a8700 (LWP 28579)):
#0 0x00007f691119e4dd in read () from /lib64/libc.so.6
#1 0x00007f6911136fd8 in _IO_new_file_underflow () from /lib64/libc.so.6
#2 0x00007f6911138ade in _IO_default_uflow_internal () from /lib64/libc.so.6
#3 0x00007f6911133fcb in getc () from /lib64/libc.so.6
#4 0x00007f69109a9875 in check_stdin_thread (arg=0x4034c0) at extensions/daemon/stdin_check.c:19
#5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f69101a7700 (LWP 28580)):
#0 0x00007f6911170bed in nanosleep () from /lib64/libc.so.6
#1 0x00007f6911170a60 in sleep () from /lib64/libc.so.6
#2 0x0000000000414f68 in check_isasl_db_thread (arg=<value optimized out>) at daemon/isasl.c:233
#3 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f690f79d700 (LWP 28581)):
#0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461500, tv=<value optimized out>) at epoll.c:404
#2 0x00007f6911f5ca4c in event_base_loop (base=0x2461500, flags=0) at event.c:1558
#3 0x00000000004133a4 in worker_libevent (arg=0x2443500) at daemon/thread.c:304
#4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f690ef9c700 (LWP 28582)):
#0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461280, tv=<value optimized out>) at epoll.c:404
#2 0x00007f6911f5ca4c in event_base_loop (base=0x2461280, flags=0) at event.c:1558
#3 0x00000000004133a4 in worker_libevent (arg=0x2443600) at daemon/thread.c:304
#4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f690e79b700 (LWP 28583)):
#0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461c80, tv=<value optimized out>) at epoll.c:404
#2 0x00007f6911f5ca4c in event_base_loop (base=0x2461c80, flags=0) at event.c:1558
#3 0x00000000004133a4 in worker_libevent (arg=0x2443700) at daemon/thread.c:304
#4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f690df9a700 (LWP 28584)):
#0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461a00, tv=<value optimized out>) at epoll.c:404
#2 0x00007f6911f5ca4c in event_base_loop (base=0x2461a00, flags=0) at event.c:1558
#3 0x00000000004133a4 in worker_libevent (arg=0x2443800) at daemon/thread.c:304
#4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f690d799700 (LWP 28585)):
#0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461780, tv=<value optimized out>) at epoll.c:404
#2 0x00007f6911f5ca4c in event_base_loop (base=0x2461780, flags=0) at event.c:1558
---Type <return> to continue, or q <return> to quit---
#3 0x00000000004133a4 in worker_libevent (arg=0x2443900) at daemon/thread.c:304
#4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f690cbd8700 (LWP 28586)):
#0 0x00007f69114603cc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f690cc2a233 in wait (this=0x3772540) at syncobject.hh:31
#2 Dispatcher::run (this=0x3772540) at dispatcher.cc:85
#3 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x3772594) at dispatcher.cc:28
#4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f690c3d7700 (LWP 28587)):
#0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f690cc280c0 in wait (this=0x2475380, d=...) at syncobject.hh:42
#2 IdleTask::run (this=0x2475380, d=...) at dispatcher.cc:244
#3 0x00007f690cc2a40f in Dispatcher::run (this=0x27f41c0) at dispatcher.cc:119
#4 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x27f4214) at dispatcher.cc:28
#5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f690bbd6700 (LWP 28588)):
#0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f690cc280c0 in wait (this=0x2475a00, d=...) at syncobject.hh:42
#2 IdleTask::run (this=0x2475a00, d=...) at dispatcher.cc:244
#3 0x00007f690cc2a40f in Dispatcher::run (this=0x3772380) at dispatcher.cc:119
#4 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x37723d4) at dispatcher.cc:28
#5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f690b3d5700 (LWP 28589)):
#0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f690cc48c77 in wait (this=0x2538000) at syncobject.hh:42
#2 wait (this=0x2538000) at tapconnmap.hh:207
#3 EventuallyPersistentEngine::notifyTapIoThread (this=0x2538000) at ep_engine.cc:3578
#4 0x00007f690cc48de1 in EvpNotifyTapIo (arg=0x2538000) at ep_engine.cc:946
#5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f6908bd0700 (LWP 2972)):
#0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f690f7a1aa4 in engine_shutdown_thread (arg=0x1833c2a80) at bucket_engine.c:1432
#2 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f6907fc7700 (LWP 3958)):
#0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f690f7a1aa4 in engine_shutdown_thread (arg=0x1a92c93c0) at bucket_engine.c:1432
#2 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f69111ab77d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f69125a2720 (LWP 28570)):
#0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461000, tv=<value optimized out>) at epoll.c:404
#2 0x00007f6911f5ca4c in event_base_loop (base=0x2461000, flags=0) at event.c:1558
#3 0x0000000000409984 in main (argc=<value optimized out>, argv=<value optimized out>) at daemon/memcached.c:7459

Comment by Chiyoung Seo [ 21/Dec/11 ]
This is netstat command output from the node that caused this rebalance to be hung:

[ec2-user@ip-10-6-87-130 ~]$ netstat -ntp
(No info could be read for "-p": geteuid()=222 but you should be root.)
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 10.6.87.130:60709 10.191.7.242:21100 ESTABLISHED -
tcp 0 0 10.6.87.130:47520 10.6.117.231:21100 ESTABLISHED -
tcp 0 0 10.6.87.130:48473 10.6.89.137:21100 ESTABLISHED -
tcp 0 0 127.0.0.1:55531 127.0.0.1:4369 ESTABLISHED -
tcp 252160 260992 10.6.87.130:34078 10.6.87.130:11210 ESTABLISHED -
tcp 0 0 10.6.87.130:11210 10.6.87.130:41679 ESTABLISHED -
tcp 0 0 10.6.87.130:11210 10.6.87.130:37700 ESTABLISHED -
tcp 0 0 10.6.87.130:47403 10.6.87.130:11210 ESTABLISHED -
tcp 0 0 10.6.87.130:21100 10.6.133.36:50821 ESTABLISHED -
tcp 0 0 10.6.87.130:44795 10.6.151.112:11210 ESTABLISHED -
tcp 0 0 10.6.87.130:22 10.68.154.26:42398 ESTABLISHED -
tcp 0 0 127.0.0.1:8091 127.0.0.1:53757 ESTABLISHED -
tcp 0 0 10.6.87.130:41679 10.6.87.130:11210 ESTABLISHED -
tcp 74460 485120 10.6.87.130:11210 10.6.87.130:34078 ESTABLISHED -
Comment by Karan Kumar (Inactive) [ 26/Dec/11 ]
Not been able to reproduce this with 1.8.0r-51 build. Will keep running more rebalancetests on this build.
Comment by Aleksey Kondratenko [ 28/Dec/11 ]
We believe this was caused by ebucketmigrator unknowingly did not buffer enough nacks in it's upstream direction. Causing deadlock. ebucketmigrator was waiting upstream memcached/ep-engine to eat sent nacks and memcached/ep-engine was waiting on ebucketmigrator consuming tap messages sent to it.

The fix is here: http://review.couchbase.org/11859 and is already merged.
Comment by Karan Kumar (Inactive) [ 03/Jan/12 ]
Have not been able to hit this issue again.
Comment by Karan Kumar (Inactive) [ 14/Feb/12 ]
Closing old tickets.
Comment by Perry Krug [ 20/Feb/12 ]
Reopening temporarily
Generated at Fri Aug 01 02:24:07 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.