[MB-7290] Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM (~65% resident ratio) Created: 29/Nov/12  Updated: 19/Aug/14  Resolved: 10/Apr/13

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Abhinav Dangeti Assignee: Mike Wiederhold
Resolution: Cannot Reproduce Votes: 0
Labels: 2.0-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: - 5:5 uni & bidirectional XDCR
- ec2 nodes with 15G RAM
- 12.04 Ubuntu LTS
- 400G disk space on each node
- http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1967-rel.deb.manifest.xml

Issue Links:
Relates to
relates to MB-9636 Rebalance fails with reason : bulk_se... Resolved

 Description   
At the time of the rebalance failure:

+ 5 nodes rebalance in on each cluster
Cluster setup: c1:c2::10:10
biXDCR_bucket: c1 <---> c2
uniXDCR_src: c1 ---> c2 :uniXDCR_dest
Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src.
c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/
c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/

On C1, Rebalance operation failed with this reason on the UI logs:

Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com',
{'EXIT',
{{timeout,
{gen_server,call,
['ns_memcached-biXDCR_bucket',
{set_vbucket,544,replica},
180000]}},
{gen_server,call,
[{'janitor_agent-biXDCR_bucket',
'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'},
{if_rebalance,<0.10136.88>,
{update_vbucket_state,544,replica,
undefined,undefined}},
infinity]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}

The second time, rebalance failed with the following UI log message:

Rebalance exited with reason {{timeout,
{gen_server,call,
['ns_memcached-biXDCR_bucket',
{set_vbucket,849,active},
180000]}},
{gen_server,call,
[{'janitor_agent-biXDCR_bucket',
'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'},
{if_rebalance,<0.21090.114>,
{update_vbucket_state,849,active,paused,
undefined}},
infinity]}}

After giving it some time, the third rebalance did complete successfully.

Will attach the grabbed diags from one of the nodes at C1 in a bit.

 Comments   
Comment by Abhinav Dangeti [ 29/Nov/12 ]
Grabbed diags from C1's ec2-177-71-230-72.sa-east-1.compute.amazonaws.com :-
https://s3.amazonaws.com/bugdb/MB-7290/ec2-177-71-230-72.sa-east-1.compute.amazonaws.com-8091-diag.txt.gz
Comment by Junyi Xie (Inactive) [ 29/Nov/12 ]
Abhinav, the error was raised when ns_server is trying to set vbucket state during rebalance under heavy workload. Please talk to ns_server team. Thanks.
Comment by Junyi Xie (Inactive) [ 29/Nov/12 ]
Please assign to ns_server team.
Comment by Aleksey Kondratenko [ 03/Dec/12 ]
Please explain what exactly is needed here from me. Looks like ordinary timeout. Memcached timeout in fact (3 minutes is no joke).
Comment by Farshid Ghods (Inactive) [ 03/Dec/12 ]
filing this under memcached timeouts then
Comment by kzeller [ 05/Dec/12 ]
      Added to RN:

  Under a heavy load of write operations on two clusters and both
        bi-directional and uni-directional replications occurring
        via XDCR, Couchbase Server 2.0 may fail during rebalance.
Comment by Junyi Xie (Inactive) [ 06/Dec/12 ]
it has nothing to do with XDCR core code, remove xdcr from the component.
Comment by Farshid Ghods (Inactive) [ 10/Dec/12 ]
deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )
Comment by Chiyoung Seo [ 15/Feb/13 ]
For the bug distributions in the engine team.
Comment by Mike Wiederhold [ 10/Apr/13 ]
This issue is 5 months old. Please open a new issue against the latest build if you see this issue again.
Generated at Fri Oct 31 14:12:56 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.