[MB-4828] rebalancing multiple nodes can hang if a bucket has less than 100k items due to a race condition in tap take-over Created: 23/Feb/12 Updated: 10/Jan/13 Resolved: 24/Feb/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 1.8.0 |
| Fix Version/s: | 1.8.1, 1.8.2, 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Farshid Ghods | Assignee: | Chiyoung Seo |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 1.8.0-release-notes, 1.8.1-release-notes, customer | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
this was observed by one of our users which had 10 buckets . some buckets had less than 10k items and tap takeover got stuck.
tap stats : 6179: vb_1014:cursor_checkpoint_id:eq_tapq:rebalance_1014: 1 97312: eq_tapq:rebalance_1014:ack_log_size: 0 97313: eq_tapq:rebalance_1014:ack_playback_size: 0 97314: eq_tapq:rebalance_1014:ack_seqno: 10 97315: eq_tapq:rebalance_1014:ack_window_full: false 97316: eq_tapq:rebalance_1014:backfill_completed: false 97317: eq_tapq:rebalance_1014:bg_backlog_size: 0 97318: eq_tapq:rebalance_1014:bg_jobs_completed: 0 97319: eq_tapq:rebalance_1014:bg_jobs_issued: 0 97320: eq_tapq:rebalance_1014:bg_queued: 0 97321: eq_tapq:rebalance_1014:bg_result_size: 0 97322: eq_tapq:rebalance_1014:bg_results: 0 97323: eq_tapq:rebalance_1014:bg_wait_for_results: false 97324: eq_tapq:rebalance_1014:complete: false 97325: eq_tapq:rebalance_1014:connected: true 97326: eq_tapq:rebalance_1014:created: 1272317 97327: eq_tapq:rebalance_1014:empty: false 97328: eq_tapq:rebalance_1014:flags: 93 (ack,backfill,vblist,takeover,checkpoints) 97329: eq_tapq:rebalance_1014:has_item: false 97330: eq_tapq:rebalance_1014:has_queued_item: true 97331: eq_tapq:rebalance_1014:idle: false 97332: eq_tapq:rebalance_1014:num_tap_nack: 0 97333: eq_tapq:rebalance_1014:num_tap_tmpfail_survivors: 0 97334: eq_tapq:rebalance_1014:paused: 1 97335: eq_tapq:rebalance_1014:pending_backfill: false 97336: eq_tapq:rebalance_1014:pending_disconnect: false 97337: eq_tapq:rebalance_1014:pending_disk_backfill: false 97338: eq_tapq:rebalance_1014:qlen: 0 97339: eq_tapq:rebalance_1014:qlen_high_pri: 0 97340: eq_tapq:rebalance_1014:qlen_low_pri: 1 97341: eq_tapq:rebalance_1014:queue_backfillremaining: 0 97342: eq_tapq:rebalance_1014:queue_backoff: 0 97343: eq_tapq:rebalance_1014:queue_drain: 0 97344: eq_tapq:rebalance_1014:queue_fill: 0 97345: eq_tapq:rebalance_1014:queue_itemondisk: 0 97346: eq_tapq:rebalance_1014:queue_memory: 0 97347: eq_tapq:rebalance_1014:rec_fetched: 5 97348: eq_tapq:rebalance_1014:recv_ack_seqno: 8 97349: eq_tapq:rebalance_1014:reserved: 1 97350: eq_tapq:rebalance_1014:seqno_ack_requested: 9 97351: eq_tapq:rebalance_1014:supports_ack: true 97352: eq_tapq:rebalance_1014:suspended: false 97353: eq_tapq:rebalance_1014:total_backlog_size: 10 97354: eq_tapq:rebalance_1014:total_noops: 20036 97355: eq_tapq:rebalance_1014:type: producer 97356: eq_tapq:rebalance_1014:vb_filter: { 1014 } 97357: eq_tapq:rebalance_1014:vb_filters: 1 |
| Comments |
| Comment by Farshid Ghods [ 23/Feb/12 ] |
| workaround is to pad the bucket with more items ( 100k ) |
| Comment by Chiyoung Seo [ 23/Feb/12 ] |
|
There is a very small time window that causes race condition in detecting a backfill completion for a vbucket takeover with a small number of items (e.g., 10 items per vbucket). The fix to this issue is now in gerrit for review:
http://review.couchbase.org/#change,13562 Farshid plans to reproduce this issue on windows cluster. |
| Comment by Thuan Nguyen [ 24/Feb/12 ] |
|
Integrated in github-ep-engine-2-0 #205 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/205/]) Result = SUCCESS Chiyoung Seo : Files : * tapconnection.cc |
| Comment by Chiyoung Seo [ 24/Feb/12 ] |
| http://review.couchbase.org/#change,13562 |