Consistent XDCR error when replicating to more than 2 clusters in the new 2.2 release

I saw this similar problem when I first tested with the 2.1 release. I'm in the amazon cloud environment. I am replicating from one CS 2.2 cluster to several other CS 2.2 clusters using XDCR. When i replicate from 1 cluster to another, I never see an issue. When i replicate from 1 cluster to 2 clusters, I still don't see an issue. However, when i replicate from 1 to 3 or more clusters, I fairly consistently see the following behavior:

The replication is going fine until all of a sudden errors start displaying in the console of the source (my replication is always one way). The behavior is for the replication rate to all destinations to slow down dramatically. In fact, in looking at the destination console (for all 3 destinations), there is some activity, then for about 30 seconds there is NO activity, then again some activity (which may last for 30 seconds or so, then no activity (again, for about 30 seconds). The end result is that all the documents are successfully replicated, but once I get into this state, the transfer rate is reduced dramatically and performs suffers.

So, i looked in the log files. What is I see in the source log file (xdcr_errors.#) is the following:

[xdcr:error,2013-09-21T13:37:43.853,ns_1@machineName.compute-1.amazonaws.com:<0.27099.25>:xdc_vbucket_rep:terminate:398]Replication (XMem mode) `ba6ef
7badea24f4fcd2e1b9e938b7b99/cust1_app1/cust1_app1` (`cust1_app1/250` -> `http://*****@ec2-54-246-81-99.eu-west-1.compute.amazonaws.com:8092/cust1...
3b7857d92c94752330262c7567bcf17792`) failed.Please see ns_server debug log for complete state dump
[xdcr:error,2013-09-21T13:37:43.858,ns_1@ec2-107-22-165-101.compute-1.amazonaws.com:<0.13879.208>:xdc_vbucket_rep:handle_info:90]Error initializing vb replic
ator ({init_state,
{rep,
<<"ba6ef7badea24f4fcd2e1b9e938b7b99/cust1_app1/cust1_app1">>,
<<"cust1_app1">>,
<<"/remoteClusters/ba6ef7badea24f4fcd2e1b9e938b7b99/buckets/cust1_app1">>,
"xmem",
[{optimistic_replication_threshold,256},
{worker_batch_size,500},
{failure_restart_interval,30},
{doc_batch_size_kb,2048},
{checkpoint_interval,1800},
{max_concurrent_reps,32},
{connection_timeout,180},
{worker_processes,4},
{http_connections,20},
{retries_per_request,2},
{xmem_worker,1},
{enable_pipeline_ops,true},
{local_conflict_resolution,false},
{socket_options,
[{keepalive,true},{nodelay,false}]},
{trace_dump_invprob,1000}]},
250,"xmem",<0.12755.0>,<0.12756.0>,
<0.12751.0>}):{error,
{badmatch,
{error,all_nodes_failed,
<<"Failed to grab remote bucket `cust1_app1`from any of known nodes">>}}}
[

Now, in each of the destination CS instances (all clusters are of size 1), I see the following error at around the same time as the above error in the source:

[user:info,2013-09-21T14:52:36.205,ns_1@machineName.us-west-2.compute.amazonaws.com:<0.1808.0>:ns_log:crash_consumpti
on_loop:64]Port server moxi on node 'babysitter_of_ns_1@127.0.0.1' exited with status 0. Restarting. Messages: 2013-09-20 14
:03:32: (cproxy_config.c.315) env: MOXI_SASL_PLAIN_USR (9)
2013-09-20 14:03:32: (cproxy_config.c.324) env: MOXI_SASL_PLAIN_PWD (9)
2013-09-20 14:03:35: (agent_config.c.703) ERROR: bad JSON configuration from http://127.0.0.1:8091/pools/default/saslBuckets
Streaming: Number of vBuckets must be a power of two > 0 and <= 65536
2013-09-20 14:03:48: (agent_config.c.703) ERROR: bad JSON configuration from http://127.0.0.1:8091/pools/default/saslBuckets
Streaming: Number of vBuckets must be a power of two > 0 and <= 65536
EOL on stdin. Exiting
[ns_server:info,2013-09-21T14:52:40.996,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:<0.8271.0>:mc_connection:run
_loop:202]mccouch connection was normally closed
[user:info,2013-09-21T14:52:40.997,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:ns_memcached-cust1_app1<0.8258.0>
:ns_memcached:terminate:738]Control connection to memcached on 'ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com' dis
connected: {{badmatch,
{error,
closed}},
[{mc_client_bina
ry,
cmd_binary_voc
al_recv,
5},
{mc_client_bina
ry,
select_bucket,
2},
{ns_memcached,
ensure_bucket,
2},
{ns_memcached,
handle_info,
2},
{gen_server,
handle_msg,
5},
{ns_memcached,
init,
1},
{gen_server,
init_it,
6},
{proc_lib,
init_p_do_appl
y,
3}]}
[ns_server:info,2013-09-21T14:52:40.996,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:<0.2008.0>:mc_connection:run
_loop:202]mccouch connection was normally closed
[error_logger:error,2013-09-21T14:52:40.997,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:error_logger<0.6.0>:ale_
error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: erlang:apply/2
pid: <0.2012.0>
registered_name: []
exception error: no match of right hand side value {error,closed}
in function mc_binary:quick_stats_recv/3
in call from mc_binary:quick_stats_loop/5
in call from mc_binary:quick_stats/5
in call from ns_memcached:do_handle_call/3
in call from ns_memcached:worker_loop/3
ancestors: ['ns_memcached-default','single_bucket_sup-default',
<0.1984.0>]
messages: []
links: [<0.1998.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 4181
stack_size: 24
reductions: 280781421
neighbours:

As stated above, I saw this problem in 2.1 and now in 2.2, and when i replicate from 1 to 3 clusters, it happens about 50% of the time.

1 Answer

« Back to question.

[{optimistic_replication_threshold,256},
{worker_batch_size,500},
{failure_restart_interval,30},
{doc_batch_size_kb,2048},
{checkpoint_interval,1800},
{max_concurrent_reps,32},

looks like your are hitting the failure restart interval
{failure_restart_interval,30},

In 2.2 they have a new replication method that goes from Memecached to memcached instead of disk to disk between the two clusters. its called "VERSION 2" in XDCR option.

you might be having CPU usage issues or bandwidth issues.
{max_concurrent_reps,32}, = the number of vbucket at the source that is sending data to the target at any given time.

When you replicate 3x you are really doing 32 x 3. Try lowing it to 10.
Also install iftop to track how much data is going out.

To check CPU usage during I like using htop.

If it is bandwidth you can solve it by adding more nodes. because the 32x3 vBuckets sending data is spread out through more machines.

Also if you are doing Master->Slave change
[{optimistic_replication_threshold,256},
to 20MB instead of 256KB.