added a comment - - edited
XDCR saw a lot of timeout from underlying ns_server and ep_engine, and therefore, a lot of vb replicators
crashed as expected. Such timeout are not expected to see for this scale of test.
Pavel mentioned this happened for a short period of time, what happened to ns_server or ep_engine during that time making it so slow and even too busy to serve xdcr request?
I am not sure what I can fix on the side of XDCR. Looks to me ep_engine or ns_server team need to triage the issue.
[xdcr:error,2012-12-02T8:16:20.578,
ns_1@10.2.3.33:<0.15416.0>:xdc_vbucket_rep:terminate:298]Replication `41d1101e89da1d590261faef5067a4e8/bucket-1/bucket-1` (`bucket-1/412` -> `
http://Administrator:password@10.2.\
3.31:8092/bucket-1%2f412%3b2b6f9272ff82c9cc0dfcdd22ce77b9d6`) failed: {timeout,{gen_server,call,[ns_config,get]}}
[xdcr:error,2012-12-02T8:20:01.874,
ns_1@10.2.3.33:<0.1527.128>:capi_replication:update_replicated_docs:100][Bucket:"bucket-0", Vb:505]: update 170 docs takes too long to finish!(total time spent: 189 secs, defaul\
t connection time out: 180 secs)
[xdcr:error,2012-12-02T8:20:11.406,
ns_1@10.2.3.33:<0.1531.128>:capi_replication:update_replicated_docs:100][Bucket:"bucket-0", Vb:432]: update 172 docs takes too long to finish!(total time spent: 199 secs, defaul\
t connection time out: 180 secs)
[xdcr:error,2012-12-02T8:33:13.109,
ns_1@10.2.3.33:<0.14775.128>:xdc_vbucket_rep:terminate:284]Shutting xdcr vb replicator ({init_state,
{rep,
<<"41d1101e89da1d590261faef5067a4e8/bucket-1/bucket-1">>,
<<"bucket-1">>,
<<"/remoteClusters/41d1101e89da1d590261faef5067a4e8/buckets/bucket-1">>,
[{connection_timeout,180000},
{continuous,true},
{http_connections,20},
{retries,2},
{socket_options,
[{keepalive,true},{nodelay,false}]},
{worker_batch_size,500},
{worker_processes,4}]},
488,<0.15362.0>,<0.15363.0>,<0.15357.0>}) down without ever successfully initializing: {badmatch,
{error,
all_nodes_failed,
<<"Failed to grab remote bucket info from any of known nodes">>}}
https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.30-1222012-849-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.31-1222012-855-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.32-1222012-852-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.33-1222012-858-diag.zip