Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Blocker
-
Resolution: Fixed
-
Affects Version/s: 2.0
-
Fix Version/s: 2.0
-
Component/s: cross-datacenter-replication, ns_server
-
Security Level: Public
-
Labels:None
-
Environment:HideUbuntu 12.04 LTS ec2 xlarge instances (15GB Memory)
http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.0-1944-rel.deb.manifest.xml
Live clusters:
C1: http://ec2-177-71-167-196.sa-east-1.compute.amazonaws.com:8091/
C2: http://ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com:8091/
biXDCR_bucket: C1 <--> C2
uniXDCR_src: C1 --> C2ShowUbuntu 12.04 LTS ec2 xlarge instances (15GB Memory) http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.0-1944-rel.deb.manifest.xml Live clusters: C1: http://ec2-177-71-167-196.sa-east-1.compute.amazonaws.com:8091/ C2: http://ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com:8091/ biXDCR_bucket: C1 <--> C2 uniXDCR_src: C1 --> C2
Description
- Front end loads for biXDCR_bucket on C1 and C2 and for uniXDCR_src on C1, and replication going on
- On C2:
- 3 nodes down: With erl_crash.dump files generated (will be attached)
- 2 nodes with erlang possibly hung, and in pend state. (In top, beam.smp keeps appearing and disappearing using up 1.0G of resident memory, but no cores generated, no erl_crash.dump files, memcached seems to be still running)
- Unable to grab diags off any of these nodes.
- Result - All items in biXDCR_bucket on C2 lost (?).
- Half the items in uniXDCR_dest on C2 lost.
Noticed a whole bunch of these crash reports on one of the "Pending" nodes on C2:
** Reason for termination ==
** {noproc,
{gen_server,call,
[remote_clusters_info,
{get_remote_bucket,
[{hostname,
"ec2-177-71-147-19.sa-east-1.compute.amazonaws.com:8091"},
{uuid,<<"0b3a63d5d8805e0c6670c619cc346299">>},
{name,"SANPAULO (C2)"},
{username,"Administrator"},
{password,"password"}],
"biXDCR_bucket",false,30000},
infinity]}}
[error_logger:error,2012-11-07T5:57:56.025,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: xdc_vbucket_rep:init/1
pid: <0.28161.8>
registered_name: []
exception exit: {noproc,
{gen_server,call,
[remote_clusters_info,
{get_remote_bucket,
[{hostname,
"ec2-177-71-147-19.sa-east-1.compute.amazonaws.com:8091"},
{uuid,
<<"0b3a63d5d8805e0c6670c619cc346299">>},
{name,"SANPAULO (C2)"},
{username,"Administrator"},
{password,"password"}],
"biXDCR_bucket",false,30000},
infinity]}}
in function gen_server:terminate/6
ancestors: [<0.3608.5>,<0.3603.5>,xdc_replication_sup,ns_server_sup,
ns_server_cluster_sup,<0.64.0>]
messages: []
links: [<0.3608.5>]
dictionary: []
trap_exit: true
status: running
heap_size: 514229
stack_size: 24
reductions: 35035
neighbours:
** Reason for termination ==
** killed
[error_logger:error,2012-11-07T5:58:41.704,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: couch_db:init/1
pid: <0.19405.4>
registered_name: []
exception exit: killed
in function gen_server:terminate/6
ancestors: [couch_server,couch_primary_services,couch_server_sup,
cb_couch_sup,ns_server_cluster_sup,<0.64.0>]
messages: []
links: []
dictionary: []
trap_exit: true
status: running
heap_size: 1597
stack_size: 24
reductions: 11968
neighbours:
Attached are the grabbed diags from one of the non-down nodes on C2.
- On C2:
- 3 nodes down: With erl_crash.dump files generated (will be attached)
- 2 nodes with erlang possibly hung, and in pend state. (In top, beam.smp keeps appearing and disappearing using up 1.0G of resident memory, but no cores generated, no erl_crash.dump files, memcached seems to be still running)
- Unable to grab diags off any of these nodes.
- Result - All items in biXDCR_bucket on C2 lost (?).
- Half the items in uniXDCR_dest on C2 lost.
Noticed a whole bunch of these crash reports on one of the "Pending" nodes on C2:
** Reason for termination ==
** {noproc,
{gen_server,call,
[remote_clusters_info,
{get_remote_bucket,
[{hostname,
"ec2-177-71-147-19.sa-east-1.compute.amazonaws.com:8091"},
{uuid,<<"0b3a63d5d8805e0c6670c619cc346299">>},
{name,"SANPAULO (C2)"},
{username,"Administrator"},
{password,"password"}],
"biXDCR_bucket",false,30000},
infinity]}}
[error_logger:error,2012-11-07T5:57:56.025,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: xdc_vbucket_rep:init/1
pid: <0.28161.8>
registered_name: []
exception exit: {noproc,
{gen_server,call,
[remote_clusters_info,
{get_remote_bucket,
[{hostname,
"ec2-177-71-147-19.sa-east-1.compute.amazonaws.com:8091"},
{uuid,
<<"0b3a63d5d8805e0c6670c619cc346299">>},
{name,"SANPAULO (C2)"},
{username,"Administrator"},
{password,"password"}],
"biXDCR_bucket",false,30000},
infinity]}}
in function gen_server:terminate/6
ancestors: [<0.3608.5>,<0.3603.5>,xdc_replication_sup,ns_server_sup,
ns_server_cluster_sup,<0.64.0>]
messages: []
links: [<0.3608.5>]
dictionary: []
trap_exit: true
status: running
heap_size: 514229
stack_size: 24
reductions: 35035
neighbours:
** Reason for termination ==
** killed
[error_logger:error,2012-11-07T5:58:41.704,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: couch_db:init/1
pid: <0.19405.4>
registered_name: []
exception exit: killed
in function gen_server:terminate/6
ancestors: [couch_server,couch_primary_services,couch_server_sup,
cb_couch_sup,ns_server_cluster_sup,<0.64.0>]
messages: []
links: []
dictionary: []
trap_exit: true
status: running
heap_size: 1597
stack_size: 24
reductions: 11968
neighbours:
Attached are the grabbed diags from one of the non-down nodes on C2.
https://s3.amazonaws.com/bugdb/MB-7129/erl_crash.dump.11-07-2012-03%3A19%3A26.753
https://s3.amazonaws.com/bugdb/MB-7129/erl_crash.dump.11-07-2012-03%3A19%3A46.743