Details
-
Type:
Bug
-
Status:
Resolved
-
Priority:
Major
-
Resolution: Cannot Reproduce
-
Affects Version/s: 2.0
-
Fix Version/s: 2.1
-
Component/s: couchbase-bucket, ns_server
-
Security Level: Public
-
Labels:
-
Environment:Hide- 5:5 uni & bidirectional XDCR
- ec2 nodes with 15G RAM
- 12.04 Ubuntu LTS
- 400G disk space on each node
- http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1967-rel.deb.manifest.xmlShow- 5:5 uni & bidirectional XDCR - ec2 nodes with 15G RAM - 12.04 Ubuntu LTS - 400G disk space on each node - http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1967-rel.deb.manifest.xml
Description
At the time of the rebalance failure:
+ 5 nodes rebalance in on each cluster
Cluster setup: c1:c2::10:10
biXDCR_bucket: c1 <---> c2
uniXDCR_src: c1 ---> c2 :uniXDCR_dest
Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src.
c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/
c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/
On C1, Rebalance operation failed with this reason on the UI logs:
Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com',
{'EXIT',
{{timeout,
{gen_server,call,
['ns_memcached-biXDCR_bucket',
{set_vbucket,544,replica},
180000]}},
{gen_server,call,
[{'janitor_agent-biXDCR_bucket',
'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'},
{if_rebalance,<0.10136.88>,
{update_vbucket_state,544,replica,
undefined,undefined}},
infinity]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
The second time, rebalance failed with the following UI log message:
Rebalance exited with reason {{timeout,
{gen_server,call,
['ns_memcached-biXDCR_bucket',
{set_vbucket,849,active},
180000]}},
{gen_server,call,
[{'janitor_agent-biXDCR_bucket',
'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'},
{if_rebalance,<0.21090.114>,
{update_vbucket_state,849,active,paused,
undefined}},
infinity]}}
After giving it some time, the third rebalance did complete successfully.
Will attach the grabbed diags from one of the nodes at C1 in a bit.
+ 5 nodes rebalance in on each cluster
Cluster setup: c1:c2::10:10
biXDCR_bucket: c1 <---> c2
uniXDCR_src: c1 ---> c2 :uniXDCR_dest
Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src.
c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/
c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/
On C1, Rebalance operation failed with this reason on the UI logs:
Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com',
{'EXIT',
{{timeout,
{gen_server,call,
['ns_memcached-biXDCR_bucket',
{set_vbucket,544,replica},
180000]}},
{gen_server,call,
[{'janitor_agent-biXDCR_bucket',
'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'},
{if_rebalance,<0.10136.88>,
{update_vbucket_state,544,replica,
undefined,undefined}},
infinity]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
The second time, rebalance failed with the following UI log message:
Rebalance exited with reason {{timeout,
{gen_server,call,
['ns_memcached-biXDCR_bucket',
{set_vbucket,849,active},
180000]}},
{gen_server,call,
[{'janitor_agent-biXDCR_bucket',
'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'},
{if_rebalance,<0.21090.114>,
{update_vbucket_state,849,active,paused,
undefined}},
infinity]}}
After giving it some time, the third rebalance did complete successfully.
Will attach the grabbed diags from one of the nodes at C1 in a bit.
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Abhinav Dangeti
made changes -
| Field | Original Value | New Value |
|---|---|---|
| Fix Version/s | 2.0.1 [ 10399 ] | |
| Component/s | ns_server [ 10019 ] |
Abhinav Dangeti
made changes -
| Summary | Rebalance-in operation failed twice with heavy front end load on an XDCR set up and with system in DGM | Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM |
Abhinav Dangeti
made changes -
| Summary | Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM | Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM (~65% resident ratio) |
| Description |
+ 5 nodes rebalance in on each cluster
Cluster setup: c1:c2::10:10 biXDCR_bucket: c1 <---> c2 uniXDCR_src: c1 ---> c2 :uniXDCR_dest Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src. c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/ c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/ On C1, Rebalance operation failed with this reason on the UI logs: Rebalance exited with reason {{bulk_set_vbucket_state_failed, [{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com', {'EXIT', {{timeout, {gen_server,call, ['ns_memcached-biXDCR_bucket', {set_vbucket,544,replica}, 180000]}}, {gen_server,call, [{'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'}, {if_rebalance,<0.10136.88>, {update_vbucket_state,544,replica, undefined,undefined}}, infinity]}}}}]}, [{janitor_agent,bulk_set_vbucket_state,4}, {ns_vbucket_mover, update_replication_post_move,3}, {ns_vbucket_mover,handle_info,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]} The second time, rebalance failed with the following UI log message: Rebalance exited with reason {{timeout, {gen_server,call, ['ns_memcached-biXDCR_bucket', {set_vbucket,849,active}, 180000]}}, {gen_server,call, [{'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'}, {if_rebalance,<0.21090.114>, {update_vbucket_state,849,active,paused, undefined}}, infinity]}} After giving it some time, the third rebalance did complete successfully. Will attach the grabbed diags from one of the nodes at C1 in a bit. |
At the time of the rebalance failure:
+ 5 nodes rebalance in on each cluster Cluster setup: c1:c2::10:10 biXDCR_bucket: c1 <---> c2 uniXDCR_src: c1 ---> c2 :uniXDCR_dest Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src. c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/ c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/ On C1, Rebalance operation failed with this reason on the UI logs: Rebalance exited with reason {{bulk_set_vbucket_state_failed, [{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com', {'EXIT', {{timeout, {gen_server,call, ['ns_memcached-biXDCR_bucket', {set_vbucket,544,replica}, 180000]}}, {gen_server,call, [{'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'}, {if_rebalance,<0.10136.88>, {update_vbucket_state,544,replica, undefined,undefined}}, infinity]}}}}]}, [{janitor_agent,bulk_set_vbucket_state,4}, {ns_vbucket_mover, update_replication_post_move,3}, {ns_vbucket_mover,handle_info,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]} The second time, rebalance failed with the following UI log message: Rebalance exited with reason {{timeout, {gen_server,call, ['ns_memcached-biXDCR_bucket', {set_vbucket,849,active}, 180000]}}, {gen_server,call, [{'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'}, {if_rebalance,<0.21090.114>, {update_vbucket_state,849,active,paused, undefined}}, infinity]}} After giving it some time, the third rebalance did complete successfully. Will attach the grabbed diags from one of the nodes at C1 in a bit. |
Junyi Xie
made changes -
| Assignee | Junyi Xie [ junyi ] | Abhinav Dangeti [ abhinav ] |
Farshid Ghods
made changes -
| Fix Version/s | 2.0.1 [ 10399 ] | |
| Fix Version/s | 2.0 [ 10114 ] |
Farshid Ghods
made changes -
| Assignee | Abhinav Dangeti [ abhinav ] | Aleksey Kondratenko [ alkondratenko ] |
Aleksey Kondratenko
made changes -
| Assignee | Aleksey Kondratenko [ alkondratenko ] | Farshid Ghods [ farshid ] |
Farshid Ghods
made changes -
| Assignee | Farshid Ghods [ farshid ] | Chiyoung Seo [ chiyoung ] |
Karen Zeller
made changes -
| Comment |
[ Added to RN:
Under a heavy load of write operations on two clusters and both bi-directional and uni-directional replications occurring via XDCR, Couchbase Server 2.0 may fail during rebalance. ] |
Karen Zeller
made changes -
| Comment |
[ Added to RN:
Under a heavy load of write operations on two clusters and both bi-directional and uni-directional replications occurring via XDCR, Couchbase Server 2.0 may fail during rebalance. ] |
Junyi Xie
made changes -
| Component/s | couchbase-bucket [ 10173 ] | |
| Component/s | cross-datacenter-replication [ 10136 ] |
Farshid Ghods
made changes -
| Fix Version/s | 2.1 [ 10414 ] | |
| Fix Version/s | 2.0.1 [ 10399 ] |
Mike Wiederhold
made changes -
| Sprint Status | Current Sprint |
Chiyoung Seo
made changes -
| Assignee | Chiyoung Seo [ chiyoung ] | Mike Wiederhold [ mikew ] |
Chiyoung Seo
made changes -
| Planned End | (re-schedule end date based on new assignee) |
Mike Wiederhold
made changes -
| Status | Open [ 1 ] | Resolved [ 5 ] |
| Resolution | Cannot Reproduce [ 5 ] |