[MB-4864] rebalancing can get stuck due to a bug detecting the backfill completion during vbucket takeover Created: 02/Mar/12 Updated: 09/Jan/13 Resolved: 14/Mar/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 1.8.0 |
| Fix Version/s: | 1.8.1, 1.8.2 |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Thuan Nguyen | Assignee: | Chiyoung Seo |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 1.8.0-release-notes, 1.8.1-release-notes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | windows 2008 R2 64bit | ||
| Attachments: |
|
| Description |
|
Install couchbase server 1.8.0 release with hotfix mb-4738 on 4 nodes (12 GB RAM each) cluster in ec2.
Load 38 million items to cluster. Resident ratio: 41% Data size on disk: 78GB Remove a node. Rebalance OK. Add another node in (not the removed node). Rebalance hang at around 80+% |
| Comments |
| Comment by Thuan Nguyen [ 08/Mar/12 ] |
|
Do rebalance out 2 nodes and rebalance hang eq_tapq:rebalance_169:ack_log_size: 0 eq_tapq:rebalance_169:ack_playback_size: 0 eq_tapq:rebalance_169:ack_seqno: 41291 eq_tapq:rebalance_169:ack_window_full: false eq_tapq:rebalance_169:backfill_completed: false eq_tapq:rebalance_169:bg_backlog_size: 0 eq_tapq:rebalance_169:bg_jobs_completed: 37724 eq_tapq:rebalance_169:bg_jobs_issued: 37724 eq_tapq:rebalance_169:bg_queued: 37724 eq_tapq:rebalance_169:bg_result_size: 0 eq_tapq:rebalance_169:bg_results: 0 eq_tapq:rebalance_169:bg_wait_for_results: false eq_tapq:rebalance_169:complete: false eq_tapq:rebalance_169:connected: true eq_tapq:rebalance_169:created: 1019807 eq_tapq:rebalance_169:empty: false eq_tapq:rebalance_169:flags: 93 (ack,backfill,vblist,takeover,checkpoints) eq_tapq:rebalance_169:has_item: false eq_tapq:rebalance_169:has_queued_item: true eq_tapq:rebalance_169:idle: false eq_tapq:rebalance_169:num_tap_nack: 0 eq_tapq:rebalance_169:num_tap_tmpfail_survivors: 0 eq_tapq:rebalance_169:paused: 1 eq_tapq:rebalance_169:pending_backfill: false eq_tapq:rebalance_169:pending_disconnect: false eq_tapq:rebalance_169:pending_disk_backfill: false eq_tapq:rebalance_169:qlen: 0 eq_tapq:rebalance_169:qlen_high_pri: 0 eq_tapq:rebalance_169:qlen_low_pri: 1 eq_tapq:rebalance_169:queue_backfillremaining: 0 eq_tapq:rebalance_169:queue_backoff: 0 eq_tapq:rebalance_169:queue_drain: 41329 eq_tapq:rebalance_169:queue_fill: 0 eq_tapq:rebalance_169:queue_itemondisk: 0 eq_tapq:rebalance_169:queue_memory: 0 eq_tapq:rebalance_169:rec_fetched: 3609 eq_tapq:rebalance_169:recv_ack_seqno: 41290 eq_tapq:rebalance_169:reserved: 1 eq_tapq:rebalance_169:seqno_ack_requested: 41290 eq_tapq:rebalance_169:supports_ack: true eq_tapq:rebalance_169:suspended: false eq_tapq:rebalance_169:total_backlog_size: 1 eq_tapq:rebalance_169:total_noops: 7935 eq_tapq:rebalance_169:type: producer eq_tapq:rebalance_169:vb_filter: { 169 } eq_tapq:rebalance_169:vb_filters: 1 |
| Comment by Chiyoung Seo [ 14/Mar/12 ] |
| Fixed in 1.8.1 branch |
| Comment by Farshid Ghods [ 11/Apr/12 ] |
|
This reason was that there is a bug in detecting the backfill completion for a vbucket takeover during rebalance.
For example, the following are the TAP stats for vbucket 685 takeover: eq_tapq:rebalance_685:ack_window_full: false eq_tapq:rebalance_685:backfill_completed: false ... eq_tapq:rebalance_685:pending_backfill: false eq_tapq:rebalance_685:pending_disconnect: false eq_tapq:rebalance_685:pending_disk_backfill: false eq_tapq:rebalance_685:queue_backfillremaining: 0 You can see that there are no items remaining for backfill, but "backfill_completed" flag is still false, which caused the takeover operation to be stuck. |