[MB-4864] rebalancing can get stuck due to a bug detecting the backfill completion during vbucket takeover Created: 02/Mar/12  Updated: 31/Jan/14  Resolved: 14/Mar/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 1.8.0
Fix Version/s: 1.8.1, 2.0-beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: 1.8.0-release-notes, 1.8.1-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64bit

Attachments: Text File 107.22.11.161_tap.txt     Text File 107.22.70.136_tap.txt     Zip Archive 107.22.84.123-8091-diag.txt.zip     Text File 107.22.84.123_stat.txt     Text File 107.22.84.123_tap.txt     Text File 107.22.84.123_tap.txt     Zip Archive 23.20.45.23-8091-diag.txt.zip     Text File 23.20.45.23_stat.txt     Text File 23.20.45.23_tap.txt     Text File 23.20.45.23_tap.txt     Zip Archive 23.20.50.242-8091-diag.txt.zip     Text File 23.20.50.242_stat.txt     Text File 23.20.50.242_tap.txt     Text File 23.20.50.242_tap.txt     Zip Archive 50.17.157.98-8091-diag.txt.zip     Text File 50.17.157.98_stat.txt     Text File 50.17.157.98_tap.txt    

 Description   
Install couchbase server 1.8.0 release with hotfix mb-4738 on 4 nodes (12 GB RAM each) cluster in ec2.
Load 38 million items to cluster.
Resident ratio: 41%
Data size on disk: 78GB
Remove a node.
Rebalance OK.
Add another node in (not the removed node).
Rebalance hang at around 80+%


 Comments   
Comment by Thuan Nguyen [ 08/Mar/12 ]
Do rebalance out 2 nodes and rebalance hang

eq_tapq:rebalance_169:ack_log_size: 0
 eq_tapq:rebalance_169:ack_playback_size: 0
 eq_tapq:rebalance_169:ack_seqno: 41291
 eq_tapq:rebalance_169:ack_window_full: false
 eq_tapq:rebalance_169:backfill_completed: false
 eq_tapq:rebalance_169:bg_backlog_size: 0
 eq_tapq:rebalance_169:bg_jobs_completed: 37724
 eq_tapq:rebalance_169:bg_jobs_issued: 37724
 eq_tapq:rebalance_169:bg_queued: 37724
 eq_tapq:rebalance_169:bg_result_size: 0
 eq_tapq:rebalance_169:bg_results: 0
 eq_tapq:rebalance_169:bg_wait_for_results: false
 eq_tapq:rebalance_169:complete: false
 eq_tapq:rebalance_169:connected: true
 eq_tapq:rebalance_169:created: 1019807
 eq_tapq:rebalance_169:empty: false
 eq_tapq:rebalance_169:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
 eq_tapq:rebalance_169:has_item: false
 eq_tapq:rebalance_169:has_queued_item: true
 eq_tapq:rebalance_169:idle: false
 eq_tapq:rebalance_169:num_tap_nack: 0
 eq_tapq:rebalance_169:num_tap_tmpfail_survivors: 0
 eq_tapq:rebalance_169:paused: 1
 eq_tapq:rebalance_169:pending_backfill: false
 eq_tapq:rebalance_169:pending_disconnect: false
 eq_tapq:rebalance_169:pending_disk_backfill: false
 eq_tapq:rebalance_169:qlen: 0
 eq_tapq:rebalance_169:qlen_high_pri: 0
 eq_tapq:rebalance_169:qlen_low_pri: 1
 eq_tapq:rebalance_169:queue_backfillremaining: 0
 eq_tapq:rebalance_169:queue_backoff: 0
 eq_tapq:rebalance_169:queue_drain: 41329
 eq_tapq:rebalance_169:queue_fill: 0
 eq_tapq:rebalance_169:queue_itemondisk: 0
 eq_tapq:rebalance_169:queue_memory: 0
 eq_tapq:rebalance_169:rec_fetched: 3609
 eq_tapq:rebalance_169:recv_ack_seqno: 41290
 eq_tapq:rebalance_169:reserved: 1
 eq_tapq:rebalance_169:seqno_ack_requested: 41290
 eq_tapq:rebalance_169:supports_ack: true
 eq_tapq:rebalance_169:suspended: false
 eq_tapq:rebalance_169:total_backlog_size: 1
 eq_tapq:rebalance_169:total_noops: 7935
 eq_tapq:rebalance_169:type: producer
 eq_tapq:rebalance_169:vb_filter: { 169 }
 eq_tapq:rebalance_169:vb_filters: 1
Comment by Chiyoung Seo [ 14/Mar/12 ]
Fixed in 1.8.1 branch
Comment by Farshid Ghods (Inactive) [ 11/Apr/12 ]
This reason was that there is a bug in detecting the backfill completion for a vbucket takeover during rebalance.

For example, the following are the TAP stats for vbucket 685 takeover:

 eq_tapq:rebalance_685:ack_window_full: false
 eq_tapq:rebalance_685:backfill_completed: false
 ...
 eq_tapq:rebalance_685:pending_backfill: false
 eq_tapq:rebalance_685:pending_disconnect: false
 eq_tapq:rebalance_685:pending_disk_backfill: false
 eq_tapq:rebalance_685:queue_backfillremaining: 0

You can see that there are no items remaining for backfill, but "backfill_completed" flag is still false, which caused the takeover operation to be stuck.
Generated at Sat Sep 20 01:50:29 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.