[MB-4864] rebalancing can get stuck due to a bug detecting the backfill completion during vbucket takeover Created: 02/Mar/12  Updated: 31/Jan/14  Resolved: 14/Mar/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 1.8.0
Fix Version/s: 1.8.1, 2.0-beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: 1.8.0-release-notes, 1.8.1-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64bit

Attachments: Text File     Text File     Zip Archive     Text File     Text File     Text File     Zip Archive     Text File     Text File     Text File     Zip Archive     Text File     Text File     Text File     Zip Archive     Text File     Text File    

Install couchbase server 1.8.0 release with hotfix mb-4738 on 4 nodes (12 GB RAM each) cluster in ec2.
Load 38 million items to cluster.
Resident ratio: 41%
Data size on disk: 78GB
Remove a node.
Rebalance OK.
Add another node in (not the removed node).
Rebalance hang at around 80+%

Comment by Thuan Nguyen [ 08/Mar/12 ]
Do rebalance out 2 nodes and rebalance hang

eq_tapq:rebalance_169:ack_log_size: 0
 eq_tapq:rebalance_169:ack_playback_size: 0
 eq_tapq:rebalance_169:ack_seqno: 41291
 eq_tapq:rebalance_169:ack_window_full: false
 eq_tapq:rebalance_169:backfill_completed: false
 eq_tapq:rebalance_169:bg_backlog_size: 0
 eq_tapq:rebalance_169:bg_jobs_completed: 37724
 eq_tapq:rebalance_169:bg_jobs_issued: 37724
 eq_tapq:rebalance_169:bg_queued: 37724
 eq_tapq:rebalance_169:bg_result_size: 0
 eq_tapq:rebalance_169:bg_results: 0
 eq_tapq:rebalance_169:bg_wait_for_results: false
 eq_tapq:rebalance_169:complete: false
 eq_tapq:rebalance_169:connected: true
 eq_tapq:rebalance_169:created: 1019807
 eq_tapq:rebalance_169:empty: false
 eq_tapq:rebalance_169:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
 eq_tapq:rebalance_169:has_item: false
 eq_tapq:rebalance_169:has_queued_item: true
 eq_tapq:rebalance_169:idle: false
 eq_tapq:rebalance_169:num_tap_nack: 0
 eq_tapq:rebalance_169:num_tap_tmpfail_survivors: 0
 eq_tapq:rebalance_169:paused: 1
 eq_tapq:rebalance_169:pending_backfill: false
 eq_tapq:rebalance_169:pending_disconnect: false
 eq_tapq:rebalance_169:pending_disk_backfill: false
 eq_tapq:rebalance_169:qlen: 0
 eq_tapq:rebalance_169:qlen_high_pri: 0
 eq_tapq:rebalance_169:qlen_low_pri: 1
 eq_tapq:rebalance_169:queue_backfillremaining: 0
 eq_tapq:rebalance_169:queue_backoff: 0
 eq_tapq:rebalance_169:queue_drain: 41329
 eq_tapq:rebalance_169:queue_fill: 0
 eq_tapq:rebalance_169:queue_itemondisk: 0
 eq_tapq:rebalance_169:queue_memory: 0
 eq_tapq:rebalance_169:rec_fetched: 3609
 eq_tapq:rebalance_169:recv_ack_seqno: 41290
 eq_tapq:rebalance_169:reserved: 1
 eq_tapq:rebalance_169:seqno_ack_requested: 41290
 eq_tapq:rebalance_169:supports_ack: true
 eq_tapq:rebalance_169:suspended: false
 eq_tapq:rebalance_169:total_backlog_size: 1
 eq_tapq:rebalance_169:total_noops: 7935
 eq_tapq:rebalance_169:type: producer
 eq_tapq:rebalance_169:vb_filter: { 169 }
 eq_tapq:rebalance_169:vb_filters: 1
Comment by Chiyoung Seo [ 14/Mar/12 ]
Fixed in 1.8.1 branch
Comment by Farshid Ghods (Inactive) [ 11/Apr/12 ]
This reason was that there is a bug in detecting the backfill completion for a vbucket takeover during rebalance.

For example, the following are the TAP stats for vbucket 685 takeover:

 eq_tapq:rebalance_685:ack_window_full: false
 eq_tapq:rebalance_685:backfill_completed: false
 eq_tapq:rebalance_685:pending_backfill: false
 eq_tapq:rebalance_685:pending_disconnect: false
 eq_tapq:rebalance_685:pending_disk_backfill: false
 eq_tapq:rebalance_685:queue_backfillremaining: 0

You can see that there are no items remaining for backfill, but "backfill_completed" flag is still false, which caused the takeover operation to be stuck.
Generated at Tue Jul 29 02:14:08 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.