Details
Description
- Create a 2:2 cluster.
- Setup unidirectional replication of 2M items
- Rebalance out one of the destination nodes
Example scenario:
Source: 10.1.3.236, 10.3.2.54
Destination: 10.1.3.238, 10.3.2.55
Rebalance out 10.3.2.55
Towards the end of rebalancing on the destination, rebalance fails with the following reason:
Rebalance exited with reason {timeout,
{gen_server,call,
[{'ns_memcached-bucket','ns_1@10.1.3.238'},
{get_vbucket,749},
60000]}}
- Setup unidirectional replication of 2M items
- Rebalance out one of the destination nodes
Example scenario:
Source: 10.1.3.236, 10.3.2.54
Destination: 10.1.3.238, 10.3.2.55
Rebalance out 10.3.2.55
Towards the end of rebalancing on the destination, rebalance fails with the following reason:
Rebalance exited with reason {timeout,
{gen_server,call,
[{'ns_memcached-bucket','ns_1@10.1.3.238'},
{get_vbucket,749},
60000]}}
-
- 10.1.3.236-8091-diag.txt.gz
- 30/Aug/12 4:19 PM
- 1.62 MB
- Abhinav Dangeti
-
- 10.1.3.238-8091-diag.txt.gz
- 30/Aug/12 4:19 PM
- 5.45 MB
- Abhinav Dangeti
-
- 10.3.2.54-8091-diag.txt.gz
- 30/Aug/12 4:19 PM
- 10.30 MB
- Abhinav Dangeti
-
- 10.3.2.55-8091-diag.txt.gz
- 30/Aug/12 4:19 PM
- 5.45 MB
- Abhinav Dangeti
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Junyi Xie
added a comment -
Looks like a bunch of ns_server crash in 10.1.3.238-8091-diag.txt.
ns_server:debug,2012-08-30T14:17:41.431,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:init:43]starting ns_port_server with delay of 5000
[error_logger:error,2012-08-30T14:17:41.431,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,ns_port_sup}
Context: child_terminated
Reason: {abnormal,71}
Offender: [{pid,<0.7380.3>},
{name,
{memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]}},
{mfargs,
{erlang,apply,
[#Fun<ns_port_sup.3.119727222>,
[memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]]]}},
{restart_type,permanent},
{shutdown,86400000},
{child_type,worker}]
[error_logger:info,2012-08-30T14:17:41.433,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,ns_port_sup}
started: [{pid,<0.7445.3>},
{name,
{memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]}},
{mfargs,
{erlang,apply,
[#Fun<ns_port_sup.3.119727222>,
[memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000",
"-p","11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]]]}},
{restart_type,permanent},
{shutdown,86400000},
{child_type,worker}]
[user:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:handle_info:107]Port server memcached on node 'ns_1@10.1.3.238' exited with status 71. Restarting. Messages: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use
[ns_server:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:log:169]memcached<0.7446.3>: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use
[ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:handle_info:62]Cushion managed supervisor for memcached failed: {abnormal,71}
[user:warn,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:die_slowly:81]Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s
[error_logger:error,2012-08-30T14:17:41.519,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_msg:76]** Generic server <0.7446.3> terminating
** Last message in was {#Port<0.127099>,{exit_status,71}}
** When Server state == {state,#Port<0.127099>,memcached,
{["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use",
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty],
[empty]},
{ok,{1346361461712443,#Ref<0.0.35.116743>}},
["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use"],
0,true}
** Reason for termination ==
** {abnormal,71}
[ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:ns_log:ns_log:handle_cast:125]suppressing duplicate log supervisor_cushion:1("Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s\n") because it's been seen 1 times in the past 36.083474 secs (last seen 36.083474 secs ago
[error_logger:error,2012-08-30T14:17:41.521,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.7446.3>
registered_name: ns_port_memcached
exception exit: {abnormal,71}
in function gen_server:terminate/6
ancestors: [<0.7445.3>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
<0.60.0>]
messages: [{'EXIT',#Port<0.127099>,normal}]
links: [<0.7445.3>]
dictionary: []
trap_exit: true
status: running
heap_size: 1597
stack_size: 24
reductions: 2399
neighbours:
[stats:error,2012-08-30T14:17:41.626,ns_1@10.1.3.238:<0.5564.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[stats:error,2012-08-30T14:17:41.628,ns_1@10.1.3.238:<0.5533.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[stats:error,2012-08-30T14:17:41.630,ns_1@10.1.3.238:<0.5557.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[error_logger:error,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,menelaus_sup}
Context: child_terminated
Reason: {noproc,
{gen_server,call,
[{'stats_reader-bucket','ns_1@10.1.3.238'},
{latest,minute,1}]}}
Offender: [{pid,<0.7432.3>},
{name,menelaus_web_alerts_srv},
{mfargs,{menelaus_web_alerts_srv,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
[error_logger:info,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,menelaus_sup}
started: [{pid,<0.7458.3>},
{name,menelaus_web_alerts_srv},
{mfargs,{menelaus_web_alerts_srv,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
[ns_server:debug,2012-08-30T14:17:43.182,ns_1@10.1.3.238:ns_heart:ns_heart:current_status:155]Ignoring failure to get stats for bucket: "bucket":
{'EXIT',{noproc,{gen_server,call,
[{'stats_reader-bucket','ns_1@10.1.3.238'},
{latest,minute}]}}}
ns_server:debug,2012-08-30T14:17:41.431,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:init:43]starting ns_port_server with delay of 5000
[error_logger:error,2012-08-30T14:17:41.431,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,ns_port_sup}
Context: child_terminated
Reason: {abnormal,71}
Offender: [{pid,<0.7380.3>},
{name,
{memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]}},
{mfargs,
{erlang,apply,
[#Fun<ns_port_sup.3.119727222>,
[memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]]]}},
{restart_type,permanent},
{shutdown,86400000},
{child_type,worker}]
[error_logger:info,2012-08-30T14:17:41.433,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,ns_port_sup}
started: [{pid,<0.7445.3>},
{name,
{memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]}},
{mfargs,
{erlang,apply,
[#Fun<ns_port_sup.3.119727222>,
[memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000",
"-p","11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]]]}},
{restart_type,permanent},
{shutdown,86400000},
{child_type,worker}]
[user:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:handle_info:107]Port server memcached on node 'ns_1@10.1.3.238' exited with status 71. Restarting. Messages: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use
[ns_server:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:log:169]memcached<0.7446.3>: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use
[ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:handle_info:62]Cushion managed supervisor for memcached failed: {abnormal,71}
[user:warn,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:die_slowly:81]Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s
[error_logger:error,2012-08-30T14:17:41.519,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_msg:76]** Generic server <0.7446.3> terminating
** Last message in was {#Port<0.127099>,{exit_status,71}}
** When Server state == {state,#Port<0.127099>,memcached,
{["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use",
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty],
[empty]},
{ok,{1346361461712443,#Ref<0.0.35.116743>}},
["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use"],
0,true}
** Reason for termination ==
** {abnormal,71}
[ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:ns_log:ns_log:handle_cast:125]suppressing duplicate log supervisor_cushion:1("Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s\n") because it's been seen 1 times in the past 36.083474 secs (last seen 36.083474 secs ago
[error_logger:error,2012-08-30T14:17:41.521,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.7446.3>
registered_name: ns_port_memcached
exception exit: {abnormal,71}
in function gen_server:terminate/6
ancestors: [<0.7445.3>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
<0.60.0>]
messages: [{'EXIT',#Port<0.127099>,normal}]
links: [<0.7445.3>]
dictionary: []
trap_exit: true
status: running
heap_size: 1597
stack_size: 24
reductions: 2399
neighbours:
[stats:error,2012-08-30T14:17:41.626,ns_1@10.1.3.238:<0.5564.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[stats:error,2012-08-30T14:17:41.628,ns_1@10.1.3.238:<0.5533.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[stats:error,2012-08-30T14:17:41.630,ns_1@10.1.3.238:<0.5557.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[error_logger:error,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,menelaus_sup}
Context: child_terminated
Reason: {noproc,
{gen_server,call,
[{'stats_reader-bucket','ns_1@10.1.3.238'},
{latest,minute,1}]}}
Offender: [{pid,<0.7432.3>},
{name,menelaus_web_alerts_srv},
{mfargs,{menelaus_web_alerts_srv,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
[error_logger:info,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,menelaus_sup}
started: [{pid,<0.7458.3>},
{name,menelaus_web_alerts_srv},
{mfargs,{menelaus_web_alerts_srv,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
[ns_server:debug,2012-08-30T14:17:43.182,ns_1@10.1.3.238:ns_heart:ns_heart:current_status:155]Ignoring failure to get stats for bucket: "bucket":
{'EXIT',{noproc,{gen_server,call,
[{'stats_reader-bucket','ns_1@10.1.3.238'},
{latest,minute}]}}}
Show
Junyi Xie
added a comment - Looks like a bunch of ns_server crash in 10.1.3.238-8091-diag.txt.
ns_server:debug,2012-08-30T14:17:41.431, ns_1@10.1.3.238 :<0.7445.3>:supervisor_cushion:init:43]starting ns_port_server with delay of 5000
[error_logger:error,2012-08-30T14:17:41.431, ns_1@10.1.3.238 :error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,ns_port_sup}
Context: child_terminated
Reason: {abnormal,71}
Offender: [{pid,<0.7380.3>},
{name,
{memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]}},
{mfargs,
{erlang,apply,
[#Fun<ns_port_sup.3.119727222>,
[memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]]]}},
{restart_type,permanent},
{shutdown,86400000},
{child_type,worker}]
[error_logger:info,2012-08-30T14:17:41.433, ns_1@10.1.3.238 :error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,ns_port_sup}
started: [{pid,<0.7445.3>},
{name,
{memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
"11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]}},
{mfargs,
{erlang,apply,
[#Fun<ns_port_sup.3.119727222>,
[memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X",
"/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000",
"-p","11210","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]]]}},
{restart_type,permanent},
{shutdown,86400000},
{child_type,worker}]
[user:info,2012-08-30T14:17:41.518, ns_1@10.1.3.238 :ns_port_memcached:ns_port_server:handle_info:107]Port server memcached on node ' ns_1@10.1.3.238 ' exited with status 71. Restarting. Messages: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use
[ns_server:info,2012-08-30T14:17:41.518, ns_1@10.1.3.238 :ns_port_memcached:ns_port_server:log:169]memcached<0.7446.3>: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use
[ns_server:info,2012-08-30T14:17:41.519, ns_1@10.1.3.238 :<0.7445.3>:supervisor_cushion:handle_info:62]Cushion managed supervisor for memcached failed: {abnormal,71}
[user:warn,2012-08-30T14:17:41.519, ns_1@10.1.3.238 :<0.7445.3>:supervisor_cushion:die_slowly:81]Service memcached exited on node ' ns_1@10.1.3.238 ' in 0.09s
[error_logger:error,2012-08-30T14:17:41.519, ns_1@10.1.3.238 :error_logger:ale_error_logger_handler:log_msg:76]** Generic server <0.7446.3> terminating
** Last message in was {#Port<0.127099>,{exit_status,71}}
** When Server state == {state,#Port<0.127099>,memcached,
{["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use",
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty,
empty,empty,empty,empty,empty,empty,empty],
[empty]},
{ok,{1346361461712443,#Ref<0.0.35.116743>}},
["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use"],
0,true}
** Reason for termination ==
** {abnormal,71}
[ns_server:info,2012-08-30T14:17:41.519, ns_1@10.1.3.238 :ns_log:ns_log:handle_cast:125]suppressing duplicate log supervisor_cushion:1("Service memcached exited on node ' ns_1@10.1.3.238 ' in 0.09s\n") because it's been seen 1 times in the past 36.083474 secs (last seen 36.083474 secs ago
[error_logger:error,2012-08-30T14:17:41.521, ns_1@10.1.3.238 :error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.7446.3>
registered_name: ns_port_memcached
exception exit: {abnormal,71}
in function gen_server:terminate/6
ancestors: [<0.7445.3>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
<0.60.0>]
messages: [{'EXIT',#Port<0.127099>,normal}]
links: [<0.7445.3>]
dictionary: []
trap_exit: true
status: running
heap_size: 1597
stack_size: 24
reductions: 2399
neighbours:
[stats:error,2012-08-30T14:17:41.626, ns_1@10.1.3.238 :<0.5564.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: [' ns_1@10.1.3.238 ']
[stats:error,2012-08-30T14:17:41.628, ns_1@10.1.3.238 :<0.5533.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: [' ns_1@10.1.3.238 ']
[stats:error,2012-08-30T14:17:41.630, ns_1@10.1.3.238 :<0.5557.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: [' ns_1@10.1.3.238 ']
[error_logger:error,2012-08-30T14:17:42.564, ns_1@10.1.3.238 :error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,menelaus_sup}
Context: child_terminated
Reason: {noproc,
{gen_server,call,
[{'stats_reader-bucket',' ns_1@10.1.3.238 '},
{latest,minute,1}]}}
Offender: [{pid,<0.7432.3>},
{name,menelaus_web_alerts_srv},
{mfargs,{menelaus_web_alerts_srv,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
[error_logger:info,2012-08-30T14:17:42.564, ns_1@10.1.3.238 :error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,menelaus_sup}
started: [{pid,<0.7458.3>},
{name,menelaus_web_alerts_srv},
{mfargs,{menelaus_web_alerts_srv,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
[ns_server:debug,2012-08-30T14:17:43.182, ns_1@10.1.3.238 :ns_heart:ns_heart:current_status:155]Ignoring failure to get stats for bucket: "bucket":
{'EXIT',{noproc,{gen_server,call,
[{'stats_reader-bucket',' ns_1@10.1.3.238 '},
{latest,minute}]}}}
Hide
Abhinav Dangeti
added a comment -
Both ns_server and memcached seem to have crashsed, aleksey can you please take a look at this?
Show
Abhinav Dangeti
added a comment - Both ns_server and memcached seem to have crashsed, aleksey can you please take a look at this?
Hide
Ketaki Gangal
added a comment -
Repeated rebalance on this cluster also fails with other errors. Abhinav can you post other rebalance failures as well?
Show
Ketaki Gangal
added a comment - Repeated rebalance on this cluster also fails with other errors. Abhinav can you post other rebalance failures as well?
Hide
Abhinav Dangeti
added a comment -
Also seen, once rebalance failed if we try to rebalance again,
rebalance "is failed with reason not_all_nodes_are_ready_yet".
rebalance "is failed with reason not_all_nodes_are_ready_yet".
Show
Abhinav Dangeti
added a comment - Also seen, once rebalance failed if we try to rebalance again,
rebalance "is failed with reason not_all_nodes_are_ready_yet".
Hide
Aleksey Kondratenko
added a comment -
From logs I'm seeing there's some swap usage.
That's causing long memcached request execution times. And we're doing quite a lot of memcached activity that understandably times out in such condition.
That's causing long memcached request execution times. And we're doing quite a lot of memcached activity that understandably times out in such condition.
Show
Aleksey Kondratenko
added a comment - From logs I'm seeing there's some swap usage.
That's causing long memcached request execution times. And we're doing quite a lot of memcached activity that understandably times out in such condition.
Hide
Hi Peter,
I am not sure what I can fix. By Alk, it is "understandably times out in such condition". I am not sure why rebalance should fail even memcached takes long execution time when using swap.
But it sounds like a generic rebalancing problem, that rebalancing may fail if there is swap usage.
Assume it is expected that rebalance fails in this case, I am not sure what XDCR can do from source node. Please advise.
Thanks,
Junyi
I am not sure what I can fix. By Alk, it is "understandably times out in such condition". I am not sure why rebalance should fail even memcached takes long execution time when using swap.
But it sounds like a generic rebalancing problem, that rebalancing may fail if there is swap usage.
Assume it is expected that rebalance fails in this case, I am not sure what XDCR can do from source node. Please advise.
Thanks,
Junyi
Show
Junyi Xie
added a comment - - edited Hi Peter,
I am not sure what I can fix. By Alk, it is "understandably times out in such condition". I am not sure why rebalance should fail even memcached takes long execution time when using swap.
But it sounds like a generic rebalancing problem, that rebalancing may fail if there is swap usage.
Assume it is expected that rebalance fails in this case, I am not sure what XDCR can do from source node. Please advise.
Thanks,
Junyi
Hide
Farshid Ghods
added a comment -
i see ns-server tried to restart memcached , then i see memcached crashed and then i also see the bug is also being closed as timeout issue.
I am going to assign this bug back to QE to dig deeper.
I am going to assign this bug back to QE to dig deeper.
Show
Farshid Ghods
added a comment - i see ns-server tried to restart memcached , then i see memcached crashed and then i also see the bug is also being closed as timeout issue.
I am going to assign this bug back to QE to dig deeper.
Hide
Farshid Ghods
added a comment -
also i find it very hard to believe that when memcached mem_used is reported as only 400 MB and the box has 2 GB ram
[stats:debug,2012-08-30T13:39:15.175,ns_1@10.1.3.238:<0.1206.0>:stats_collector:maybe_log_stats:126](at {{1970,1,16},{5,59,19}}) Stats for bucket "bucket":
curr_items 287939
curr_items_tot 635438
mem_used 429145136
[couchdb:error,2012-08-30T14:16:53.148,ns_1@10.1.3.238:<0.3227.3>:couch_log:error:42]Uncaught error in HTTP request: {exit,
{timeout,
{gen_server,call,
['ns_memcached-bucket',
{stats,<<>>},
60000]}}}
As alk mentioned there are some swap usage
730582: {swap_used,629829632}]},
743695: {swap_used,629829632}]},
743872: {swap_used,629829632}]},
but again there are as you see 500 MB free memory available and we dont have indexing going on here so not sure why we are using swap
1549173: {free_memory,5756739584},
1550434: {free_memory,5756342272},
1550518: {free_memory,5757095936},
1550601: {free_memory,579661824},
1550682: {free_memory,5757095936},
1550765: {free_memory,579661824},
1551091: {free_memory,5143658496},
1761876: {free_memory,321335296},
1761946: {free_memory,5732896768},
yes this machine has only 2 GB ram but memcached is only using 400 MB and we have 500 MB free space.
math does not add up for me and we can not close this bug until QE digs deepers into logs and find out more about what went wrong
box also has 4 cpus
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
stepping : 2
cpu MHz : 2660.000
cache size : 12288 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc ida nonstop_tsc arat pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.00
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
[stats:debug,2012-08-30T13:39:15.175,ns_1@10.1.3.238:<0.1206.0>:stats_collector:maybe_log_stats:126](at {{1970,1,16},{5,59,19}}) Stats for bucket "bucket":
curr_items 287939
curr_items_tot 635438
mem_used 429145136
[couchdb:error,2012-08-30T14:16:53.148,ns_1@10.1.3.238:<0.3227.3>:couch_log:error:42]Uncaught error in HTTP request: {exit,
{timeout,
{gen_server,call,
['ns_memcached-bucket',
{stats,<<>>},
60000]}}}
As alk mentioned there are some swap usage
730582: {swap_used,629829632}]},
743695: {swap_used,629829632}]},
743872: {swap_used,629829632}]},
but again there are as you see 500 MB free memory available and we dont have indexing going on here so not sure why we are using swap
1549173: {free_memory,5756739584},
1550434: {free_memory,5756342272},
1550518: {free_memory,5757095936},
1550601: {free_memory,579661824},
1550682: {free_memory,5757095936},
1550765: {free_memory,579661824},
1551091: {free_memory,5143658496},
1761876: {free_memory,321335296},
1761946: {free_memory,5732896768},
yes this machine has only 2 GB ram but memcached is only using 400 MB and we have 500 MB free space.
math does not add up for me and we can not close this bug until QE digs deepers into logs and find out more about what went wrong
box also has 4 cpus
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
stepping : 2
cpu MHz : 2660.000
cache size : 12288 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc ida nonstop_tsc arat pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.00
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
Show
Farshid Ghods
added a comment - also i find it very hard to believe that when memcached mem_used is reported as only 400 MB and the box has 2 GB ram
[stats:debug,2012-08-30T13:39:15.175, ns_1@10.1.3.238 :<0.1206.0>:stats_collector:maybe_log_stats:126](at {{1970,1,16},{5,59,19}}) Stats for bucket "bucket":
curr_items 287939
curr_items_tot 635438
mem_used 429145136
[couchdb:error,2012-08-30T14:16:53.148, ns_1@10.1.3.238 :<0.3227.3>:couch_log:error:42]Uncaught error in HTTP request: {exit,
{timeout,
{gen_server,call,
['ns_memcached-bucket',
{stats,<<>>},
60000]}}}
As alk mentioned there are some swap usage
730582: {swap_used,629829632}]},
743695: {swap_used,629829632}]},
743872: {swap_used,629829632}]},
but again there are as you see 500 MB free memory available and we dont have indexing going on here so not sure why we are using swap
1549173: {free_memory,5756739584},
1550434: {free_memory,5756342272},
1550518: {free_memory,5757095936},
1550601: {free_memory,579661824},
1550682: {free_memory,5757095936},
1550765: {free_memory,579661824},
1551091: {free_memory,5143658496},
1761876: {free_memory,321335296},
1761946: {free_memory,5732896768},
yes this machine has only 2 GB ram but memcached is only using 400 MB and we have 500 MB free space.
math does not add up for me and we can not close this bug until QE digs deepers into logs and find out more about what went wrong
box also has 4 cpus
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
stepping : 2
cpu MHz : 2660.000
cache size : 12288 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc ida nonstop_tsc arat pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.00
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
Hide
Farshid Ghods
added a comment -
Junyi/Alk,
feel free to declare this as won't fix but please look at my latest comment regarding the environment and stats gathered here.
feel free to declare this as won't fix but please look at my latest comment regarding the environment and stats gathered here.
Show
Farshid Ghods
added a comment - Junyi/Alk,
feel free to declare this as won't fix but please look at my latest comment regarding the environment and stats gathered here.
Hide
Peter Wansch
added a comment -
Damien is changing the XDCR replication manager initialization code which may address this problem.
Show
Peter Wansch
added a comment - Damien is changing the XDCR replication manager initialization code which may address this problem.
Hide
Aleksey Kondratenko
added a comment -
Apparently Damien is working on this. Thus getting this off my list
Show
Aleksey Kondratenko
added a comment - Apparently Damien is working on this. Thus getting this off my list
Hide
Thuan Nguyen
added a comment -
Integrated in github-ns-server-2-0 #461 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/461/])
MB-6493: Throttle xdc_vbucket_rep initialization (Revision ff26c4fda40322ba2c0eb5d039edf5fedbc282f5)
MB-6493: Add supervisor2 and use for restart of xdc_vbucker_rep (Revision 3f5ee80e654f7b83621e25b4f0f5c52c3a84b6b3)
MB-6493: Fix vb map cache regression when xdcr init fails (Revision 807184854ac3c67740830dc83b5b17170282fe60)
Result = SUCCESS
pwansch :
Files :
* src/xdc_replication.erl
* src/xdc_vbucket_rep.erl
pwansch :
Files :
* src/xdc_replication.erl
* src/xdc_vbucket_rep.erl
* src/ns_config_default.erl
* src/supervisor2.erl
* include/xdc_replicator.hrl
* src/xdc_vbucket_rep_sup.erl
pwansch :
Files :
* src/xdc_vbucket_rep.erl
Result = SUCCESS
pwansch :
Files :
* src/xdc_replication.erl
* src/xdc_vbucket_rep.erl
pwansch :
Files :
* src/xdc_replication.erl
* src/xdc_vbucket_rep.erl
* src/ns_config_default.erl
* src/supervisor2.erl
* include/xdc_replicator.hrl
* src/xdc_vbucket_rep_sup.erl
pwansch :
Files :
* src/xdc_vbucket_rep.erl
Show
Thuan Nguyen
added a comment - Integrated in github-ns-server-2-0 #461 (See [ http://qa.hq.northscale.net/job/github-ns-server-2-0/461/ ])
MB-6493 : Throttle xdc_vbucket_rep initialization (Revision ff26c4fda40322ba2c0eb5d039edf5fedbc282f5)
MB-6493 : Add supervisor2 and use for restart of xdc_vbucker_rep (Revision 3f5ee80e654f7b83621e25b4f0f5c52c3a84b6b3)
MB-6493 : Fix vb map cache regression when xdcr init fails (Revision 807184854ac3c67740830dc83b5b17170282fe60)
Result = SUCCESS
pwansch :
Files :
* src/xdc_replication.erl
* src/xdc_vbucket_rep.erl
pwansch :
Files :
* src/xdc_replication.erl
* src/xdc_vbucket_rep.erl
* src/ns_config_default.erl
* src/supervisor2.erl
* include/xdc_replicator.hrl
* src/xdc_vbucket_rep_sup.erl
pwansch :
Files :
* src/xdc_vbucket_rep.erl
Hide
Damien Katz
added a comment -
Abhinav, can you verify if this is still a problem with my latest changes?
Show
Damien Katz
added a comment - Abhinav, can you verify if this is still a problem with my latest changes?
Hide
Abhinav Dangeti
added a comment -
Reproduced the scenario on build 1695: Problem doesn't persist.
Show
Abhinav Dangeti
added a comment - Reproduced the scenario on build 1695: Problem doesn't persist.
Show
Farshid Ghods
added a comment - Ketaki,
are you still running more tests to verify the fix ?
Hide
Ketaki Gangal
added a comment -
2.0-1700. Rebalance is working as expected, seeing no timeout errors.
Show
Ketaki Gangal
added a comment - 2.0-1700. Rebalance is working as expected, seeing no timeout errors.
MB-6385.