Environment:
** 4 physical servers. Each server has 32GB RAM, 4 core CPU and 2 regular spin disks
10.2.1.61
10.2.1.62
10.2.1.63
10.2.1.64
Cluster setup:
** Create cluster with 2 nodes (10.2.1.63 [master] and 10.2.1.64)
** Data path on c:/data and view path on d:/view or e:/view
** Couchbase RAM quota: 28GB
** Create 2 buckets, 14GB default bucket and 10GB sasl bucket (each bucket has 1 replica and index enable)
** Each bucket has one doc and 2 views
** Load 30 million items to default bucket and 20 million items to sasl bucket to let the active resident ratio down to around 70%
(as spec in
http://hub.internal.couchbase.com/confluence/display/QA/views-test but I modify it to fix in 2 nodes cluster)
** Run access phase in 3 hours
Then running RB-1:
** Add node 10.2.1.61 to cluster and rebalance. See sasl bucket rebalance first.
** Monitor rebalance process. Seeing rebalance on node 61 reach 50% white 45% on other 2 nodes (63 and 64)
** Then few minutes later, rebalance failed with error "Resetting rebalance status since it's not really running"
[user:info,2013-02-02T1:58:40.905,
ns_1@10.2.1.61:ns_config<0.3600.6>:ns_janitor:maybe_stop_rebalance_status:147]Resetting rebalance status since it's not really running
** On diags of node 61, node 61 could not talk to orchestrator, so it took over orchestrator
[ns_server:error,2013-02-02T1:58:27.895,
ns_1@10.2.1.61:<0.5315.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:27.895,
ns_1@10.2.1.61:<0.499.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:28.004,
ns_1@10.2.1.61:<0.7772.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:28.066,
ns_1@10.2.1.61:<0.6957.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:32.029,
ns_1@10.2.1.61:<0.7772.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:32.434,
ns_1@10.2.1.61:<0.6957.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:34.494,
ns_1@10.2.1.61:<0.7772.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:34.556,
ns_1@10.2.1.61:<0.6957.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[user:info,2013-02-02T1:58:36.756,
ns_1@10.2.1.61:mb_master<0.14711.6>:mb_master:handle_info:219]Haven't heard from a higher priority node or a master, so I'm taking over.
[ns_server:debug,2013-02-02T1:58:36.756,
ns_1@10.2.1.61:mb_master_sup<0.9505.14>:misc:start_singleton:854]start_singleton(gen_fsm, ns_orchestrator, [], []): monitoring <20150.3131.0> from '
ns_1@10.2.1.61'
[ns_server:debug,2013-02-02T1:58:36.756,
ns_1@10.2.1.61:mb_master_sup<0.9505.14>:misc:start_singleton:854]start_singleton(gen_server, ns_tick, [], []): monitoring <20150.3132.0> from '
ns_1@10.2.1.61'
[ns_server:debug,2013-02-02T1:58:36.756,
ns_1@10.2.1.61:mb_master_sup<0.9505.14>:misc:start_singleton:854]start_singleton(gen_server, auto_failover, [], []): monitoring <20150.3134.0> from '
ns_1@10.2.1.61'
[error_logger:info,2013-02-02T1:58:36.756,
ns_1@10.2.1.61:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
** In diags node 63 (old orchestrator) I saw a lot vbucket_mover crashed around rebalance failed
========================CRASH REPORT=========================
crasher:
initial call: ns_single_vbucket_mover:mover/6
pid: <0.15041.759>
registered_name: []
exception exit: {unexpected_exit,{'EXIT',<0.7018.476>,shutdown}}
in function ns_single_vbucket_mover:spawn_and_wait/1
in call from ns_single_vbucket_mover:mover_inner/6
in call from misc:try_with_maybe_ignorant_after/2
in call from ns_single_vbucket_mover:mover/6
ancestors: [<0.7018.476>,<0.5453.476>]
messages: [{'EXIT',<0.7018.476>,shutdown}]
links: [<0.7018.476>]
dictionary: [{cleanup_list,[<0.15227.759>,<0.16464.759>]}]
trap_exit: true
status: running
heap_size: 2584
stack_size: 24
reductions: 5950
neighbours:
** Time stamp of this test:
Bucket "sasl" rebalance does not seem to be swap rebalance ns_vbucket_mover000
ns_1@10.2.1.63 20:17:01 - Fri Feb 1, 2013
Started rebalancing bucket sasl ns_rebalancer000
ns_1@10.2.1.63 20:16:58 - Fri Feb 1, 2013
Starting rebalance, KeepNodes = ['
ns_1@10.2.1.64','
ns_1@10.2.1.63',
'
ns_1@10.2.1.61'], EjectNodes = [] ns_orchestrator004
ns_1@10.2.1.63 20:16:58 - Fri Feb 1, 2013
Rebalance failed at this phase RB-1 right before complete rebalance of first bucket.
Resetting rebalance status since it's not really running ns_janitor000
ns_1@10.2.1.61 01:58:40 - Sat Feb 2, 2013
Haven't heard from a higher priority node or a master, so I'm taking over. mb_master000
ns_1@10.2.1.61 01:58:36 - Sat Feb 2, 2013
Haven't heard from a higher priority node or a master, so I'm taking over. mb_master000
ns_1@10.2.1.63 02:12:25 - Sat Feb 2, 2013
Link to manifest file
http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.1-144-rel.setup.exe.manifest.xml
Link to collect info of all nodes
https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201301/3nodes-phy-servers-201-144-reb-stopped-by-janitor-20130201-183832.tgz
** 4 physical servers. Each server has 32GB RAM, 4 core CPU and 2 regular spin disks
10.2.1.61
10.2.1.62
10.2.1.63
10.2.1.64
Cluster setup:
** Create cluster with 2 nodes (10.2.1.63 [master] and 10.2.1.64)
** Data path on c:/data and view path on d:/view or e:/view
** Couchbase RAM quota: 28GB
** Create 2 buckets, 14GB default bucket and 10GB sasl bucket (each bucket has 1 replica and index enable)
** Each bucket has one doc and 2 views
** Load 30 million items to default bucket and 20 million items to sasl bucket to let the active resident ratio down to around 70%
(as spec in http://hub.internal.couchbase.com/confluence/display/QA/views-test but I modify it to fix in 2 nodes cluster)
** Run access phase in 3 hours
Then running RB-1:
** Add node 10.2.1.61 to cluster and rebalance. See sasl bucket rebalance first.
** Monitor rebalance process. Seeing rebalance on node 61 reach 50% white 45% on other 2 nodes (63 and 64)
** Then few minutes later, rebalance failed with error "Resetting rebalance status since it's not really running"
[user:info,2013-02-02T1:58:40.905,ns_1@10.2.1.61:ns_config<0.3600.6>:ns_janitor:maybe_stop_rebalance_status:147]Resetting rebalance status since it's not really running
** On diags of node 61, node 61 could not talk to orchestrator, so it took over orchestrator
[ns_server:error,2013-02-02T1:58:27.895,ns_1@10.2.1.61:<0.5315.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:27.895,ns_1@10.2.1.61:<0.499.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:28.004,ns_1@10.2.1.61:<0.7772.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:28.066,ns_1@10.2.1.61:<0.6957.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:32.029,ns_1@10.2.1.61:<0.7772.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:32.434,ns_1@10.2.1.61:<0.6957.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:34.494,ns_1@10.2.1.61:<0.7772.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[ns_server:error,2013-02-02T1:58:34.556,ns_1@10.2.1.61:<0.6957.14>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
{timeout,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
rebalance_progress,2000]}}}
[user:info,2013-02-02T1:58:36.756,ns_1@10.2.1.61:mb_master<0.14711.6>:mb_master:handle_info:219]Haven't heard from a higher priority node or a master, so I'm taking over.
[ns_server:debug,2013-02-02T1:58:36.756,ns_1@10.2.1.61:mb_master_sup<0.9505.14>:misc:start_singleton:854]start_singleton(gen_fsm, ns_orchestrator, [], []): monitoring <20150.3131.0> from 'ns_1@10.2.1.61'
[ns_server:debug,2013-02-02T1:58:36.756,ns_1@10.2.1.61:mb_master_sup<0.9505.14>:misc:start_singleton:854]start_singleton(gen_server, ns_tick, [], []): monitoring <20150.3132.0> from 'ns_1@10.2.1.61'
[ns_server:debug,2013-02-02T1:58:36.756,ns_1@10.2.1.61:mb_master_sup<0.9505.14>:misc:start_singleton:854]start_singleton(gen_server, auto_failover, [], []): monitoring <20150.3134.0> from 'ns_1@10.2.1.61'
[error_logger:info,2013-02-02T1:58:36.756,ns_1@10.2.1.61:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
** In diags node 63 (old orchestrator) I saw a lot vbucket_mover crashed around rebalance failed
========================CRASH REPORT=========================
crasher:
initial call: ns_single_vbucket_mover:mover/6
pid: <0.15041.759>
registered_name: []
exception exit: {unexpected_exit,{'EXIT',<0.7018.476>,shutdown}}
in function ns_single_vbucket_mover:spawn_and_wait/1
in call from ns_single_vbucket_mover:mover_inner/6
in call from misc:try_with_maybe_ignorant_after/2
in call from ns_single_vbucket_mover:mover/6
ancestors: [<0.7018.476>,<0.5453.476>]
messages: [{'EXIT',<0.7018.476>,shutdown}]
links: [<0.7018.476>]
dictionary: [{cleanup_list,[<0.15227.759>,<0.16464.759>]}]
trap_exit: true
status: running
heap_size: 2584
stack_size: 24
reductions: 5950
neighbours:
** Time stamp of this test:
Bucket "sasl" rebalance does not seem to be swap rebalance ns_vbucket_mover000 ns_1@10.2.1.63 20:17:01 - Fri Feb 1, 2013
Started rebalancing bucket sasl ns_rebalancer000 ns_1@10.2.1.63 20:16:58 - Fri Feb 1, 2013
Starting rebalance, KeepNodes = ['ns_1@10.2.1.64','ns_1@10.2.1.63',
'ns_1@10.2.1.61'], EjectNodes = [] ns_orchestrator004 ns_1@10.2.1.63 20:16:58 - Fri Feb 1, 2013
Rebalance failed at this phase RB-1 right before complete rebalance of first bucket.
Resetting rebalance status since it's not really running ns_janitor000 ns_1@10.2.1.61 01:58:40 - Sat Feb 2, 2013
Haven't heard from a higher priority node or a master, so I'm taking over. mb_master000 ns_1@10.2.1.61 01:58:36 - Sat Feb 2, 2013
Haven't heard from a higher priority node or a master, so I'm taking over. mb_master000 ns_1@10.2.1.63 02:12:25 - Sat Feb 2, 2013
Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.1-144-rel.setup.exe.manifest.xml
Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201301/3nodes-phy-servers-201-144-reb-stopped-by-janitor-20130201-183832.tgz