Details
Description
add 6 nodes to cluster and do rebalance
load 200K items
remove 5 nodes, start rebalance
rebalance is failed
Rebalance exited with reason {bulk_set_vbucket_state_failed,
[{'ns_1@10.2.2.60',
{'EXIT',
{timeout,
{gen_server,call,
[{'janitor_agent-default','ns_1@10.2.2.60'},
{update_vbucket_state,<0.26047.9>,65,
replica,passive,undefined},
30000]}}}}]}
load 200K items
remove 5 nodes, start rebalance
rebalance is failed
Rebalance exited with reason {bulk_set_vbucket_state_failed,
[{'ns_1@10.2.2.60',
{'EXIT',
{timeout,
{gen_server,call,
[{'janitor_agent-default','ns_1@10.2.2.60'},
{update_vbucket_state,<0.26047.9>,65,
replica,passive,undefined},
30000]}}}}]}
-
- 10.2.2.108-8091-diag.txt.gz
- 30/Jul/12 7:54 PM
- 3.25 MB
- Iryna Mironava
-
- 10.2.2.60-8091-diag.txt.gz
- 30/Jul/12 7:54 PM
- 5.53 MB
- Iryna Mironava
-
- 10.2.2.63-8091-diag.txt.gz
- 30/Jul/12 7:54 PM
- 1.99 MB
- Iryna Mironava
-
- 10.2.2.64-8091-diag.txt.gz
- 30/Jul/12 7:54 PM
- 1.96 MB
- Iryna Mironava
-
- 10.2.2.65-8091-diag.txt.gz
- 30/Jul/12 7:54 PM
- 1.77 MB
- Iryna Mironava
-
- 10.2.2.67-8091-diag.txt.gz
- 30/Jul/12 7:54 PM
- 1.75 MB
- Iryna Mironava
-
Hide
- cbcollect_60.zip
- 30/Jul/12 7:54 PM
- 10.58 MB
- Iryna Mironava
-
- cbcollect_info_20120730-233116/couchbase.log 693 kB
- cbcollect_info_20120730-233116/ns_server.couchdb.log 28.30 MB
- cbcollect_info_20120730-233116/stats.log 5 kB
- cbcollect_info_20120730-233116/ns_server.error.log 406 kB
- cbcollect_info_20120730-233116/ns_server.info.log 69.39 MB
- cbcollect_info_20120730-233116/ns_server.views.log 24.96 MB
- cbcollect_info_20120730-233116/diag.log 5.00 MB
- cbcollect_info_20120730-233116/ns_server.debug.log 89.84 MB
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
I hit this bug again in longevity test with build 2.0.0-1554 on 8 node centos 6.2 64bit
After rebalance failed as in bugMB-6137, I click rebalance again.
Rebalance failed after running few minutes.
{proc_lib,init_p_do_apply,3}]}]
2012-08-10 11:33:51.954 mb_master:0:info:message(ns_1@10.3.121.15) - Haven't heard from a higher priority node or a master, so I'm taking over.
2012-08-10 12:14:10.170 ns_orchestrator:4:info:message(ns_1@10.3.121.13) - Starting rebalance, KeepNodes = ['ns_1@10.3.121.13','ns_1@10.3.121.14',
'ns_1@10.3.121.15','ns_1@10.3.121.16',
'ns_1@10.3.121.17','ns_1@10.3.121.20',
'ns_1@10.3.121.22','ns_1@10.3.121.23'], EjectNodes = []
2012-08-10 12:14:14.130 ns_rebalancer:0:info:message(ns_1@10.3.121.13) - Started rebalancing bucket default
2012-08-10 12:15:41.223 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.14',
{'EXIT',
{{timeout,
{gen_server,call,
[<15325.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,
768,769,785,786,787,905,906,907,908]},
30000]}},
{gen_server,call,
[{'janitor_agent-default',
'ns_1@10.3.121.14'},
{update_vbucket_state,<0.20111.35>,908,
replica,undefined,'ns_1@10.3.121.20'},
60000]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
Diags file is in the following link.
https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/8nodes-1554-rebalance-exited-bulk_set_vbucket_state_failed-20120810.tgz
After rebalance failed as in bug
Rebalance failed after running few minutes.
{proc_lib,init_p_do_apply,3}]}]
2012-08-10 11:33:51.954 mb_master:0:info:message(ns_1@10.3.121.15) - Haven't heard from a higher priority node or a master, so I'm taking over.
2012-08-10 12:14:10.170 ns_orchestrator:4:info:message(ns_1@10.3.121.13) - Starting rebalance, KeepNodes = ['ns_1@10.3.121.13','ns_1@10.3.121.14',
'ns_1@10.3.121.15','ns_1@10.3.121.16',
'ns_1@10.3.121.17','ns_1@10.3.121.20',
'ns_1@10.3.121.22','ns_1@10.3.121.23'], EjectNodes = []
2012-08-10 12:14:14.130 ns_rebalancer:0:info:message(ns_1@10.3.121.13) - Started rebalancing bucket default
2012-08-10 12:15:41.223 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.14',
{'EXIT',
{{timeout,
{gen_server,call,
[<15325.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,
768,769,785,786,787,905,906,907,908]},
30000]}},
{gen_server,call,
[{'janitor_agent-default',
'ns_1@10.3.121.14'},
{update_vbucket_state,<0.20111.35>,908,
replica,undefined,'ns_1@10.3.121.20'},
60000]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
Diags file is in the following link.
https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/8nodes-1554-rebalance-exited-bulk_set_vbucket_state_failed-20120810.tgz
Show
Thuan Nguyen
added a comment - - edited I hit this bug again in longevity test with build 2.0.0-1554 on 8 node centos 6.2 64bit
After rebalance failed as in bug MB-6137 , I click rebalance again.
Rebalance failed after running few minutes.
{proc_lib,init_p_do_apply,3}]}]
2012-08-10 11:33:51.954 mb_master:0:info:message( ns_1@10.3.121.15 ) - Haven't heard from a higher priority node or a master, so I'm taking over.
2012-08-10 12:14:10.170 ns_orchestrator:4:info:message( ns_1@10.3.121.13 ) - Starting rebalance, KeepNodes = [' ns_1@10.3.121.13 ',' ns_1@10.3.121.14 ',
' ns_1@10.3.121.15 ',' ns_1@10.3.121.16 ',
' ns_1@10.3.121.17 ',' ns_1@10.3.121.20 ',
' ns_1@10.3.121.22 ',' ns_1@10.3.121.23 '], EjectNodes = []
2012-08-10 12:14:14.130 ns_rebalancer:0:info:message( ns_1@10.3.121.13 ) - Started rebalancing bucket default
2012-08-10 12:15:41.223 ns_orchestrator:2:info:message( ns_1@10.3.121.13 ) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{' ns_1@10.3.121.14 ',
{'EXIT',
{{timeout,
{gen_server,call,
[<15325.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,
768,769,785,786,787,905,906,907,908]},
30000]}},
{gen_server,call,
[{'janitor_agent-default',
' ns_1@10.3.121.14 '},
{update_vbucket_state,<0.20111.35>,908,
replica,undefined,' ns_1@10.3.121.20 '},
60000]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
Diags file is in the following link.
https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/8nodes-1554-rebalance-exited-bulk_set_vbucket_state_failed-20120810.tgz
Hide
Farshid Ghods
added a comment -
is this a time out issue ?
if so then tony said views are not running here and memcached is working fine so not sure why this happened ?
if so then tony said views are not running here and memcached is working fine so not sure why this happened ?
Show
Farshid Ghods
added a comment - is this a time out issue ?
if so then tony said views are not running here and memcached is working fine so not sure why this happened ?
Hide
Aleksey Kondratenko
added a comment -
Most likely same paging issue. Here's some results. We started changing vbucket filter and we successfully performed vbucket_filter_change request:
[ns_server:info] [2012-08-10 12:15:08] [ns_1@10.3.121.14:<0.24474.64>:ebucketmigrator_srv:handle_call:292] Successfully changed vbucket filter on tap stream `replication_ns_1@10.3.121.14`.
after that we're waiting TAP_OPAQUE_VB_FILTER_CHANGE_COMPLETE on socket to reach state were all messages from old set of vbuckets are consumed.
And that only happens 33 seconds later:
[ns_server:info] [2012-08-10 12:15:41] [ns_1@10.3.121.14:<0.24474.64>:ebucketmigrator_srv:handle_info:361] Got vbucket filter change completion message. Completing state transition to a new ebucketmigrator.
meanwhile timeout for vbucket filter change is 30 seconds which expires at :38
[ns_server:debug] [2012-08-10 12:15:38] [ns_1@10.3.121.14:<0.6425.23>:ns_process_registry:handle_info:98] Got exit msg: {'EXIT',<0.24732.64>,
{#Ref<0.0.415.179297>,exit,
{timeout,
{gen_server,call,
[<0.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,768,769,
785,786,787,905,906,907,908]},
30000]}},
[{gen_server,call,3},
{ns_vbm_new_sup,
'-perform_vbucket_filter_change/6-fun-1-',7},
{misc,'-executing_on_new_process/1-fun-0-',3}]}}
There's another issue that I'll fix which is ebucketmigrator in vbucket state transition state needs to die whan 'vbucket filter change transaction' dies. But that does not happen here, causing next janitor run to fail.
[rebalance:warn] [2012-08-10 12:15:38] [ns_1@10.3.121.14:<0.24474.64>:ebucketmigrator_srv:handle_info:396] Unexpected handle_info({'EXIT',<0.24732.64>,
{#Ref<0.0.415.179297>,exit,
{timeout,
{gen_server,call,
[<0.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,768,769,785,
786,787,905,906,907,908]},
30000]}},
[{gen_server,call,3},
{ns_vbm_new_sup,
'-perform_vbucket_filter_change/6-fun-1-',7},
{misc,'-executing_on_new_process/1-fun-0-',3}]}}, {state,
#Port<0.2227749>,
#Port<0.2227745>,
#Port<0.2227750>,
#Port<0.2227747>,
<0.24476.64>,
<<>>,
<<>>,
{set,
17,
16,
16,
8,
80,
48,
{[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[]},
{{[766],
[],
[907,
516],
[],
[785,
769],
[765,
717],
[],
[906],
[],
[768],
[787,
764,
716],
[767],
[905],
[517],
[],
[786,
715]}}},
659492,
false,
false,
0,
{1344,
626121,
496177},
started,
{<0.24732.64>,
#Ref<0.0.415.179308>},
<<"replication_ns_1@10.3.121.14">>,
<0.24474.64>,
{had_backfill,
false,
undefined,
[]}})
[ns_server:info] [2012-08-10 12:15:08] [ns_1@10.3.121.14:<0.24474.64>:ebucketmigrator_srv:handle_call:292] Successfully changed vbucket filter on tap stream `replication_ns_1@10.3.121.14`.
after that we're waiting TAP_OPAQUE_VB_FILTER_CHANGE_COMPLETE on socket to reach state were all messages from old set of vbuckets are consumed.
And that only happens 33 seconds later:
[ns_server:info] [2012-08-10 12:15:41] [ns_1@10.3.121.14:<0.24474.64>:ebucketmigrator_srv:handle_info:361] Got vbucket filter change completion message. Completing state transition to a new ebucketmigrator.
meanwhile timeout for vbucket filter change is 30 seconds which expires at :38
[ns_server:debug] [2012-08-10 12:15:38] [ns_1@10.3.121.14:<0.6425.23>:ns_process_registry:handle_info:98] Got exit msg: {'EXIT',<0.24732.64>,
{#Ref<0.0.415.179297>,exit,
{timeout,
{gen_server,call,
[<0.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,768,769,
785,786,787,905,906,907,908]},
30000]}},
[{gen_server,call,3},
{ns_vbm_new_sup,
'-perform_vbucket_filter_change/6-fun-1-',7},
{misc,'-executing_on_new_process/1-fun-0-',3}]}}
There's another issue that I'll fix which is ebucketmigrator in vbucket state transition state needs to die whan 'vbucket filter change transaction' dies. But that does not happen here, causing next janitor run to fail.
[rebalance:warn] [2012-08-10 12:15:38] [ns_1@10.3.121.14:<0.24474.64>:ebucketmigrator_srv:handle_info:396] Unexpected handle_info({'EXIT',<0.24732.64>,
{#Ref<0.0.415.179297>,exit,
{timeout,
{gen_server,call,
[<0.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,768,769,785,
786,787,905,906,907,908]},
30000]}},
[{gen_server,call,3},
{ns_vbm_new_sup,
'-perform_vbucket_filter_change/6-fun-1-',7},
{misc,'-executing_on_new_process/1-fun-0-',3}]}}, {state,
#Port<0.2227749>,
#Port<0.2227745>,
#Port<0.2227750>,
#Port<0.2227747>,
<0.24476.64>,
<<>>,
<<>>,
{set,
17,
16,
16,
8,
80,
48,
{[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[]},
{{[766],
[],
[907,
516],
[],
[785,
769],
[765,
717],
[],
[906],
[],
[768],
[787,
764,
716],
[767],
[905],
[517],
[],
[786,
715]}}},
659492,
false,
false,
0,
{1344,
626121,
496177},
started,
{<0.24732.64>,
#Ref<0.0.415.179308>},
<<"replication_ns_1@10.3.121.14">>,
<0.24474.64>,
{had_backfill,
false,
undefined,
[]}})
Show
Aleksey Kondratenko
added a comment - Most likely same paging issue. Here's some results. We started changing vbucket filter and we successfully performed vbucket_filter_change request:
[ns_server:info] [2012-08-10 12:15:08] [ ns_1@10.3.121.14 :<0.24474.64>:ebucketmigrator_srv:handle_call:292] Successfully changed vbucket filter on tap stream ` replication_ns_1@10.3.121.14 `.
after that we're waiting TAP_OPAQUE_VB_FILTER_CHANGE_COMPLETE on socket to reach state were all messages from old set of vbuckets are consumed.
And that only happens 33 seconds later:
[ns_server:info] [2012-08-10 12:15:41] [ ns_1@10.3.121.14 :<0.24474.64>:ebucketmigrator_srv:handle_info:361] Got vbucket filter change completion message. Completing state transition to a new ebucketmigrator.
meanwhile timeout for vbucket filter change is 30 seconds which expires at :38
[ns_server:debug] [2012-08-10 12:15:38] [ ns_1@10.3.121.14 :<0.6425.23>:ns_process_registry:handle_info:98] Got exit msg: {'EXIT',<0.24732.64>,
{#Ref<0.0.415.179297>,exit,
{timeout,
{gen_server,call,
[<0.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,768,769,
785,786,787,905,906,907,908]},
30000]}},
[{gen_server,call,3},
{ns_vbm_new_sup,
'-perform_vbucket_filter_change/6-fun-1-',7},
{misc,'-executing_on_new_process/1-fun-0-',3}]}}
There's another issue that I'll fix which is ebucketmigrator in vbucket state transition state needs to die whan 'vbucket filter change transaction' dies. But that does not happen here, causing next janitor run to fail.
[rebalance:warn] [2012-08-10 12:15:38] [ ns_1@10.3.121.14 :<0.24474.64>:ebucketmigrator_srv:handle_info:396] Unexpected handle_info({'EXIT',<0.24732.64>,
{#Ref<0.0.415.179297>,exit,
{timeout,
{gen_server,call,
[<0.24474.64>,
{start_vbucket_filter_change,
[516,517,715,716,717,764,765,766,767,768,769,785,
786,787,905,906,907,908]},
30000]}},
[{gen_server,call,3},
{ns_vbm_new_sup,
'-perform_vbucket_filter_change/6-fun-1-',7},
{misc,'-executing_on_new_process/1-fun-0-',3}]}}, {state,
#Port<0.2227749>,
#Port<0.2227745>,
#Port<0.2227750>,
#Port<0.2227747>,
<0.24476.64>,
<<>>,
<<>>,
{set,
17,
16,
16,
8,
80,
48,
{[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[]},
{{[766],
[],
[907,
516],
[],
[785,
769],
[765,
717],
[],
[906],
[],
[768],
[787,
764,
716],
[767],
[905],
[517],
[],
[786,
715]}}},
659492,
false,
false,
0,
{1344,
626121,
496177},
started,
{<0.24732.64>,
#Ref<0.0.415.179308>},
<<" replication_ns_1@10.3.121.14 ">>,
<0.24474.64>,
{had_backfill,
false,
undefined,
[]}})
Hide
Aleksey Kondratenko
added a comment -
How about simple mlockall instead ? That however assumes we'll run under root
Show
Aleksey Kondratenko
added a comment - How about simple mlockall instead ? That however assumes we'll run under root
Hide
Aleksey Kondratenko
added a comment -
I proposed lowering bucket's quota as IMHO reasonable treatment. We can increase timeouts, but that's not going to fix anything. Real fix is avoiding swapping
Show
Aleksey Kondratenko
added a comment - I proposed lowering bucket's quota as IMHO reasonable treatment. We can increase timeouts, but that's not going to fix anything. Real fix is avoiding swapping
Hide
Thuan Nguyen
added a comment -
I got this timeout bug again with swap disable and bucket size smaller than system RAM size (node RAM size is 9GB, bucket size is 6GB)
I set up 10 nodes cluster installed couchbase server 2.0.0-1573.
10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.26
10.3.121.22
10.3.121.26
10.3.121.28
10.3.121.24
10.3.121.25
10.3.121.23
Load 40+ million items to bucket with 3 views.
Doing some rebalance in, out and failover with load about 10K ops (set, get, delete, expire, query view). Total resident ratio is 55%
Then I do swap rebalance with reboot server. Add node 25 to cluster and remove node 22 out of cluster.
During rebalance, reboot node 25. Rebalance failed as expected.
Waiting for warm up completed on node 25, click remove node 22 and rebalance. Rebalance failed in seconds with time out error
2012-08-13 11:46:35.173 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.26',
{'EXIT',
{{{unexpected_reason,
{{badmatch,{error,closed}},
[{mc_binary,quick_stats_recv,3},
{mc_binary,quick_stats_loop,5},
{mc_binary,quick_stats,5},
{ebucketmigrator_srv,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}},
[{misc,executing_on_new_process,1},
{ns_vbm_new_sup,
local_change_vbucket_filter,4},
{replication_changes,
change_vbucket_filter,4},
{replication_changes,
'-set_incoming_replication_map/3-lc$^5/1-5-',
2},
{replication_changes,
set_incoming_replication_map,3},
{janitor_agent,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[{'janitor_agent-default',
'ns_1@10.3.121.26'},
{if_rebalance,<0.4981.53>,
{update_vbucket_state,366,replica,
undefined,'ns_1@10.3.121.25'}},
60000]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
Here is the link to diags
https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/10nodes-1573-swap-reb-failed-after-reboot-add-node-20120813.tgz
After get diags, I try swap rebalance again. Every time I click rebalance, I get rebalance failed in few seconds with error
Rebalance exited with reason {bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.25',
{'EXIT',
{{{{{badmatch,{not_found,no_db_file}},
[{couch_set_view_group,
monitor_partitions,3},
{couch_set_view_group,
monitor_partitions,2},
{couch_set_view_group,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[<18552.29669.0>,
{add_replicas,
6641967501501417383806791632523375489625984170218009161305639732734490564036031833581652368858531824236549515887750274907866037489577590312730316390889342644699189815496146066239026917192134394144920743793928168199194568457626425643944116224},
infinity]}},
{gen_server,call,
['capi_set_view_manager-default',
{set_vbucket_states,
[missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,replica,
replica,replica,replica,replica,
replica,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
The cluster is in failed state now.
I set up 10 nodes cluster installed couchbase server 2.0.0-1573.
10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.26
10.3.121.22
10.3.121.26
10.3.121.28
10.3.121.24
10.3.121.25
10.3.121.23
Load 40+ million items to bucket with 3 views.
Doing some rebalance in, out and failover with load about 10K ops (set, get, delete, expire, query view). Total resident ratio is 55%
Then I do swap rebalance with reboot server. Add node 25 to cluster and remove node 22 out of cluster.
During rebalance, reboot node 25. Rebalance failed as expected.
Waiting for warm up completed on node 25, click remove node 22 and rebalance. Rebalance failed in seconds with time out error
2012-08-13 11:46:35.173 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.26',
{'EXIT',
{{{unexpected_reason,
{{badmatch,{error,closed}},
[{mc_binary,quick_stats_recv,3},
{mc_binary,quick_stats_loop,5},
{mc_binary,quick_stats,5},
{ebucketmigrator_srv,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}},
[{misc,executing_on_new_process,1},
{ns_vbm_new_sup,
local_change_vbucket_filter,4},
{replication_changes,
change_vbucket_filter,4},
{replication_changes,
'-set_incoming_replication_map/3-lc$^5/1-5-',
2},
{replication_changes,
set_incoming_replication_map,3},
{janitor_agent,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[{'janitor_agent-default',
'ns_1@10.3.121.26'},
{if_rebalance,<0.4981.53>,
{update_vbucket_state,366,replica,
undefined,'ns_1@10.3.121.25'}},
60000]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
Here is the link to diags
https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/10nodes-1573-swap-reb-failed-after-reboot-add-node-20120813.tgz
After get diags, I try swap rebalance again. Every time I click rebalance, I get rebalance failed in few seconds with error
Rebalance exited with reason {bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.25',
{'EXIT',
{{{{{badmatch,{not_found,no_db_file}},
[{couch_set_view_group,
monitor_partitions,3},
{couch_set_view_group,
monitor_partitions,2},
{couch_set_view_group,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[<18552.29669.0>,
{add_replicas,
6641967501501417383806791632523375489625984170218009161305639732734490564036031833581652368858531824236549515887750274907866037489577590312730316390889342644699189815496146066239026917192134394144920743793928168199194568457626425643944116224},
infinity]}},
{gen_server,call,
['capi_set_view_manager-default',
{set_vbucket_states,
[missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,replica,
replica,replica,replica,replica,
replica,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
The cluster is in failed state now.
Show
Thuan Nguyen
added a comment - I got this timeout bug again with swap disable and bucket size smaller than system RAM size (node RAM size is 9GB, bucket size is 6GB)
I set up 10 nodes cluster installed couchbase server 2.0.0-1573.
10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.26
10.3.121.22
10.3.121.26
10.3.121.28
10.3.121.24
10.3.121.25
10.3.121.23
Load 40+ million items to bucket with 3 views.
Doing some rebalance in, out and failover with load about 10K ops (set, get, delete, expire, query view). Total resident ratio is 55%
Then I do swap rebalance with reboot server. Add node 25 to cluster and remove node 22 out of cluster.
During rebalance, reboot node 25. Rebalance failed as expected.
Waiting for warm up completed on node 25, click remove node 22 and rebalance. Rebalance failed in seconds with time out error
2012-08-13 11:46:35.173 ns_orchestrator:2:info:message( ns_1@10.3.121.13 ) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{' ns_1@10.3.121.26 ',
{'EXIT',
{{{unexpected_reason,
{{badmatch,{error,closed}},
[{mc_binary,quick_stats_recv,3},
{mc_binary,quick_stats_loop,5},
{mc_binary,quick_stats,5},
{ebucketmigrator_srv,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}},
[{misc,executing_on_new_process,1},
{ns_vbm_new_sup,
local_change_vbucket_filter,4},
{replication_changes,
change_vbucket_filter,4},
{replication_changes,
'-set_incoming_replication_map/3-lc$^5/1-5-',
2},
{replication_changes,
set_incoming_replication_map,3},
{janitor_agent,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[{'janitor_agent-default',
' ns_1@10.3.121.26 '},
{if_rebalance,<0.4981.53>,
{update_vbucket_state,366,replica,
undefined,' ns_1@10.3.121.25 '}},
60000]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
Here is the link to diags
https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/10nodes-1573-swap-reb-failed-after-reboot-add-node-20120813.tgz
After get diags, I try swap rebalance again. Every time I click rebalance, I get rebalance failed in few seconds with error
Rebalance exited with reason {bulk_set_vbucket_state_failed,
[{' ns_1@10.3.121.25 ',
{'EXIT',
{{{{{badmatch,{not_found,no_db_file}},
[{couch_set_view_group,
monitor_partitions,3},
{couch_set_view_group,
monitor_partitions,2},
{couch_set_view_group,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[<18552.29669.0>,
{add_replicas,
6641967501501417383806791632523375489625984170218009161305639732734490564036031833581652368858531824236549515887750274907866037489577590312730316390889342644699189815496146066239026917192134394144920743793928168199194568457626425643944116224},
infinity]}},
{gen_server,call,
['capi_set_view_manager-default',
{set_vbucket_states,
[missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,replica,
replica,replica,replica,replica,
replica,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
missing,missing,missing,missing,
The cluster is in failed state now.
Show
Aleksey Kondratenko
added a comment - I need atop recordings too
Hide
Aleksey Kondratenko
added a comment -
Actually, it's entirely different issue. Or possibly two different issues.
Show
Aleksey Kondratenko
added a comment - Actually, it's entirely different issue. Or possibly two different issues.
Hide
Thuan Nguyen
added a comment -
atop files in the following link
https://s3.amazonaws.com/packages.couchbase/atop-files/2.0.0/atop-10nodes-1573-swap-reb-reboot-failed-20120813.tgz
https://s3.amazonaws.com/packages.couchbase/atop-files/2.0.0/atop-10nodes-1573-swap-reb-reboot-failed-20120813.tgz
Show
Thuan Nguyen
added a comment - atop files in the following link
https://s3.amazonaws.com/packages.couchbase/atop-files/2.0.0/atop-10nodes-1573-swap-reb-reboot-failed-20120813.tgz
Hide
Aleksey Kondratenko
added a comment -
Show
Aleksey Kondratenko
added a comment - First issue is new. Filed bug MB-6216 .
Going to look at second problem which is perhaps different.
Hide
Farshid Ghods
added a comment -
moving this out of cblock list now.
let's look at (MB-6284) which is about rebalance failures in a much smaller scale
let's look at (
Show
Farshid Ghods
added a comment - moving this out of cblock list now.
let's look at ( MB-6284 ) which is about rebalance failures in a much smaller scale
Show
Andrei Baranouski
added a comment - Iryna, could you verify if it's still reproduced
Show
Iryna Mironava
added a comment - not reproduced
MB-6058: increased janitor_agent set_vbucket_state timeout (Revision e35e15c611adeb83cec127f42cba5ed2f07502bb)Result = SUCCESS
Aliaksey Artamonau :
Files :
* src/janitor_agent.erl