[MB-7290] Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM (~65% resident ratio) Created: 29/Nov/12 Updated: 10/Apr/13 Resolved: 10/Apr/13 |
|
| Status: | Resolved |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket, ns_server |
| Affects Version/s: | 2.0 |
| Fix Version/s: | 2.1 |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Abhinav Dangeti | Assignee: | Mike Wiederhold |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | 2.0-release-notes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
- 5:5 uni & bidirectional XDCR
- ec2 nodes with 15G RAM - 12.04 Ubuntu LTS - 400G disk space on each node - http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1967-rel.deb.manifest.xml |
||
| Description |
|
At the time of the rebalance failure:
+ 5 nodes rebalance in on each cluster Cluster setup: c1:c2::10:10 biXDCR_bucket: c1 <---> c2 uniXDCR_src: c1 ---> c2 :uniXDCR_dest Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src. c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/ c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/ On C1, Rebalance operation failed with this reason on the UI logs: Rebalance exited with reason {{bulk_set_vbucket_state_failed, [{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com', {'EXIT', {{timeout, {gen_server,call, ['ns_memcached-biXDCR_bucket', {set_vbucket,544,replica}, 180000]}}, {gen_server,call, [{'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'}, {if_rebalance,<0.10136.88>, {update_vbucket_state,544,replica, undefined,undefined}}, infinity]}}}}]}, [{janitor_agent,bulk_set_vbucket_state,4}, {ns_vbucket_mover, update_replication_post_move,3}, {ns_vbucket_mover,handle_info,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]} The second time, rebalance failed with the following UI log message: Rebalance exited with reason {{timeout, {gen_server,call, ['ns_memcached-biXDCR_bucket', {set_vbucket,849,active}, 180000]}}, {gen_server,call, [{'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'}, {if_rebalance,<0.21090.114>, {update_vbucket_state,849,active,paused, undefined}}, infinity]}} After giving it some time, the third rebalance did complete successfully. Will attach the grabbed diags from one of the nodes at C1 in a bit. |
| Comments |
| Comment by Abhinav Dangeti [ 29/Nov/12 ] |
|
Grabbed diags from C1's ec2-177-71-230-72.sa-east-1.compute.amazonaws.com :-
https://s3.amazonaws.com/bugdb/MB-7290/ec2-177-71-230-72.sa-east-1.compute.amazonaws.com-8091-diag.txt.gz |
| Comment by Junyi Xie [ 29/Nov/12 ] |
| Abhinav, the error was raised when ns_server is trying to set vbucket state during rebalance under heavy workload. Please talk to ns_server team. Thanks. |
| Comment by Junyi Xie [ 29/Nov/12 ] |
| Please assign to ns_server team. |
| Comment by Aleksey Kondratenko [ 03/Dec/12 ] |
| Please explain what exactly is needed here from me. Looks like ordinary timeout. Memcached timeout in fact (3 minutes is no joke). |
| Comment by Farshid Ghods [ 03/Dec/12 ] |
| filing this under memcached timeouts then |
| Comment by Karen Zeller [ 05/Dec/12 ] |
|
Added to RN:
Under a heavy load of write operations on two clusters and both bi-directional and uni-directional replications occurring via XDCR, Couchbase Server 2.0 may fail during rebalance. |
| Comment by Junyi Xie [ 06/Dec/12 ] |
| it has nothing to do with XDCR core code, remove xdcr from the component. |
| Comment by Farshid Ghods [ 10/Dec/12 ] |
| deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th ) |
| Comment by Chiyoung Seo [ 15/Feb/13 ] |
| For the bug distributions in the engine team. |
| Comment by Mike Wiederhold [ 10/Apr/13 ] |
| This issue is 5 months old. Please open a new issue against the latest build if you see this issue again. |