[MB-4828] rebalancing multiple nodes can hang if a bucket has less than 100k items due to a race condition in tap take-over Created: 23/Feb/12  Updated: 31/Jan/14  Resolved: 24/Feb/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 1.8.0
Fix Version/s: 1.8.1, 2.0-beta, 2.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Farshid Ghods (Inactive) Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: 1.8.0-release-notes, 1.8.1-release-notes, customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
this was observed by one of our users which had 10 buckets . some buckets had less than 10k items and tap takeover got stuck.


tap stats :

   6179: vb_1014:cursor_checkpoint_id:eq_tapq:rebalance_1014: 1
  97312: eq_tapq:rebalance_1014:ack_log_size: 0
  97313: eq_tapq:rebalance_1014:ack_playback_size: 0
  97314: eq_tapq:rebalance_1014:ack_seqno: 10
  97315: eq_tapq:rebalance_1014:ack_window_full: false
  97316: eq_tapq:rebalance_1014:backfill_completed: false
  97317: eq_tapq:rebalance_1014:bg_backlog_size: 0
  97318: eq_tapq:rebalance_1014:bg_jobs_completed: 0
  97319: eq_tapq:rebalance_1014:bg_jobs_issued: 0
  97320: eq_tapq:rebalance_1014:bg_queued: 0
  97321: eq_tapq:rebalance_1014:bg_result_size: 0
  97322: eq_tapq:rebalance_1014:bg_results: 0
  97323: eq_tapq:rebalance_1014:bg_wait_for_results: false
  97324: eq_tapq:rebalance_1014:complete: false
  97325: eq_tapq:rebalance_1014:connected: true
  97326: eq_tapq:rebalance_1014:created: 1272317
  97327: eq_tapq:rebalance_1014:empty: false
  97328: eq_tapq:rebalance_1014:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
  97329: eq_tapq:rebalance_1014:has_item: false
  97330: eq_tapq:rebalance_1014:has_queued_item: true
  97331: eq_tapq:rebalance_1014:idle: false
  97332: eq_tapq:rebalance_1014:num_tap_nack: 0
  97333: eq_tapq:rebalance_1014:num_tap_tmpfail_survivors: 0
  97334: eq_tapq:rebalance_1014:paused: 1
  97335: eq_tapq:rebalance_1014:pending_backfill: false
  97336: eq_tapq:rebalance_1014:pending_disconnect: false
  97337: eq_tapq:rebalance_1014:pending_disk_backfill: false
  97338: eq_tapq:rebalance_1014:qlen: 0
  97339: eq_tapq:rebalance_1014:qlen_high_pri: 0
  97340: eq_tapq:rebalance_1014:qlen_low_pri: 1
  97341: eq_tapq:rebalance_1014:queue_backfillremaining: 0
  97342: eq_tapq:rebalance_1014:queue_backoff: 0
  97343: eq_tapq:rebalance_1014:queue_drain: 0
  97344: eq_tapq:rebalance_1014:queue_fill: 0
  97345: eq_tapq:rebalance_1014:queue_itemondisk: 0
  97346: eq_tapq:rebalance_1014:queue_memory: 0
  97347: eq_tapq:rebalance_1014:rec_fetched: 5
  97348: eq_tapq:rebalance_1014:recv_ack_seqno: 8
  97349: eq_tapq:rebalance_1014:reserved: 1
  97350: eq_tapq:rebalance_1014:seqno_ack_requested: 9
  97351: eq_tapq:rebalance_1014:supports_ack: true
  97352: eq_tapq:rebalance_1014:suspended: false
  97353: eq_tapq:rebalance_1014:total_backlog_size: 10
  97354: eq_tapq:rebalance_1014:total_noops: 20036
  97355: eq_tapq:rebalance_1014:type: producer
  97356: eq_tapq:rebalance_1014:vb_filter: { 1014 }
  97357: eq_tapq:rebalance_1014:vb_filters: 1

 Comments   
Comment by Farshid Ghods (Inactive) [ 23/Feb/12 ]
workaround is to pad the bucket with more items ( 100k )
Comment by Chiyoung Seo [ 23/Feb/12 ]
There is a very small time window that causes race condition in detecting a backfill completion for a vbucket takeover with a small number of items (e.g., 10 items per vbucket). The fix to this issue is now in gerrit for review:

http://review.couchbase.org/#change,13562

Farshid plans to reproduce this issue on windows cluster.
Comment by Thuan Nguyen [ 24/Feb/12 ]
Integrated in github-ep-engine-2-0 #205 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/205/])
    MB-4828 Check backfill completion in TapProducer::nextFgFetched() (Revision 4140edecc57912851998f30e9bbe076ddff96fc5)

     Result = SUCCESS
Chiyoung Seo :
Files :
* tapconnection.cc
Comment by Chiyoung Seo [ 24/Feb/12 ]
http://review.couchbase.org/#change,13562
Generated at Sat Aug 30 03:10:22 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.