[MB-4595] rebalancing in a new 1.8 node if there are only 1 checkpoint open results in an imbalanced cluster because ep-engine in the source node does not backfill items from the open checkpoint Created: 28/Dec/11  Updated: 31/Jan/14  Resolved: 25/Mar/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0
Fix Version/s: 1.8.1, 2.0-beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Karan Kumar (Inactive) Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: 1.8.0-release-notes, 1.8.1-release-notes, 2.0-dev-preview-4-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 54-64 bit

Attachments: Zip Archive 10.112.27.9-diag.txt.zip     Zip Archive 10.12.27.94-diag.txt.zip     Zip Archive 10.34.78.189-diag.txt.zip     Zip Archive 10.38.90.119-diag.txt.zip     Zip Archive 10.83.47.43-diag.txt.zip     JPEG File rebalance.jpg    

 Description   
Keeping this cluster alive.
http://ec2-67-202-63-126.compute-1.amazonaws.com:8091/index.html#sec=overview

Steps
1) Create 4 node cluster on 172.
2) Upgrade these 4 nodes cluster to 180.
3) Rebalance in 2 new 180 nodes to this cluster at the same time. (10.83.47.43 and 10.112.27.9 are the brand new 180 nodes that were added onto the cluster. )
4) After rebalance. The active/replica item count on newly added node (10.83.47.43 and 10.112.27.9) is 0.
5) Post rebalance, Was able to load the data into (10.83.47.43 and 10.112.27.9) using python loader.

Attaching logs from all the nodes.

 Comments   
Comment by Karan Kumar (Inactive) [ 28/Dec/11 ]
On a separate cluster was able to reproduce this.

Attaching the screenshot after rebalance.
Comment by Karan Kumar (Inactive) [ 28/Dec/11 ]
Also,
Before rebalance had total 54K keys. (on 4 nodes)
After rebalance I see only 36K in total . (in the screenshot)

Comment by Karan Kumar (Inactive) [ 28/Dec/11 ]
Assigning this to Chiyoung.
This turned out to be a bug in ep-engine with checkpoint management.

We want to make sure that before rebalancing in new nodes into the cluster, the old nodes have checkpoint greater than 1.

Verified that the issue does not occur when checkpoint id is greater than 1.
Comment by Farshid Ghods (Inactive) [ 29/Dec/11 ]
this bug is not a regression and exist in current installations but only happen if user restarts a node and attempts to add and rebalance new nodes when existing nodes only have 1 checkpoint ( cluster has been running less than an hour or there are less than 500k items in the cluster )
Comment by Chiyoung Seo [ 29/Dec/11 ]
This is a bug in checkpoint synchronization, but not a blocker for 1.8 release.

This issue happens in the following scenario:

1) Set up the 1.7.x cluster and add a very small number of items (e.g., 100K items) into the cluster. At this time, each active vbucket has only one checkpoint with ID 1
2) Shut down the 1.7.x cluster and upgrade it to 1.8 and restart the cluster
3) During the warmup, each vbucket is loaded from disk and its open checkpoint id is updated from vbucket_state table in disk. In this case, each active vbucket open checkpoint still has an id 1, but won't have any items in the open checkpoint.
4) Add a brand-new 1.8 node into the cluster and rebalance.
5) Some active vbuckets are taken over to this new node. However, this does not require backfill operations on the existing nodes because they have the open checkpoint id 1 for all active vbuckets and the new node starts with the open checkpoint 1 as well.
6) After rebalance, the new node has 0 items on its active vbuckets.

The above scenario is the corner case because we tested it with the 1.7.x cluster that has been running for a very short time that is less than a new checkpoint creation interval 10 minutes. Therefore, it is not likely to happen in our customers' clusters that have a long running 1.7.x membase and large open checkpoint ids for active vbuckets.
Comment by Aleksey Kondratenko [ 14/Feb/12 ]
I'm able to reproduce that deterministically on 2.0 with lots of items. With checkpoint ids greater then 0 and 1.
Comment by Farshid Ghods (Inactive) [ 14/Feb/12 ]
then thats a different bug can you please open a new issue and mark it as blocker for 2.0 and 1.8.1 ( since now 1.8 is now based on master branch _
Comment by Chiyoung Seo [ 27/Feb/12 ]
As Mike is booked with QA-related tasks, I will work on this issue for 1.8.1 release.
Comment by Chiyoung Seo [ 25/Mar/12 ]
http://review.couchbase.org/#change,14297
Comment by Thuan Nguyen [ 26/Mar/12 ]
Integrated in github-ep-engine-2-0 #230 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/230/])
    MB-4595 Schedule backfill for a fresh client with empty data (Revision 303ab54372e422b122116a85f2f084071b1491ff)

     Result = SUCCESS
Chiyoung Seo :
Files :
* checkpoint.cc
* tapconnection.cc
* ep_testsuite.cc
* checkpoint.hh
Generated at Thu Aug 28 20:19:54 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.