[MB-7321] XDCR: constant crashes/time-outs/mb_master restarts during perf. tests on Windows Created: 03/Dec/12  Updated: 19/Feb/14  Resolved: 24/Jan/13

Status: Closed
Project: Couchbase Server
Component/s: cross-datacenter-replication, ns_server
Affects Version/s: 2.0
Fix Version/s: 2.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: VMs, Windows 64-bit, 24GB, 4 cores
build 1969


 Description   
2 <-> 2 nodes, 2 buckets per cluster, unidir replication
4K ops/sec/cluster, 50/50 gets/sets), no views

 Comments   
Comment by Pavel Paulau [ 03/Dec/12 ]
diags:

https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.30-1222012-849-diag.zip

https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.31-1222012-855-diag.zip

https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.32-1222012-852-diag.zip

https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.33-1222012-858-diag.zip
Comment by Farshid Ghods (Inactive) [ 03/Dec/12 ]
Pavel,

to understand the severity of this issue

does this impact rate of replication ? is the cluster usable ?
does this problem go away when you reduce the load on the cluster ?
Comment by Pavel Paulau [ 03/Dec/12 ]
It does impact but slightly and for short period of time, for given light workload it results in small queue spikes.

Symptoms are similar to issues with scheduler threads whereas _async_ threads were enabled in recent builds.
Comment by Abhinav Dangeti [ 03/Dec/12 ]
Do you see these timeouts right at the start when you just set up the replication?
For e.g: there would be xdcr errors if replication is set up immediately after creating a replication reference and you would be seeing "Failures in grabbing vbucket stats" ..

I tried reproducing your scenario, but in my case i started replication a couple of minutes after i set up the replication reference, and i noticed no crashes or drop in the replication rate at any point until the finish, I had a load with the similar sets and gets ratio as well.

So a fixed value for this timeout that we need to give the cluster between setting up the replication reference and actually starting the replication, is something that I am not sure of, but if we do give it a couple of minutes I am pretty sure that we shouldn't be seeing any xdcr errors in grabbing vbucket stats.
Comment by Pavel Paulau [ 04/Dec/12 ]
No, it happens after 4-5 hours of test run time.

Last run was the most troubling so far, 2 nodes were marked as down even after test. There are logs here if you are curious:
http://qa.hq.northscale.net/job/xperf-win/32/
Comment by Junyi Xie (Inactive) [ 06/Dec/12 ]
XDCR saw a lot of timeout from underlying ns_server and ep_engine, and therefore, a lot of vb replicators
crashed as expected. Such timeout are not expected to see for this scale of test.

Pavel mentioned this happened for a short period of time, what happened to ns_server or ep_engine during that time making it so slow and even too busy to serve xdcr request?


I am not sure what I can fix on the side of XDCR. Looks to me ep_engine or ns_server team need to triage the issue.



[xdcr:error,2012-12-02T8:16:20.578,ns_1@10.2.3.33:<0.15416.0>:xdc_vbucket_rep:terminate:298]Replication `41d1101e89da1d590261faef5067a4e8/bucket-1/bucket-1` (`bucket-1/412` -> `http://Administrator:password@10.2.\
3.31:8092/bucket-1%2f412%3b2b6f9272ff82c9cc0dfcdd22ce77b9d6`) failed: {timeout,{gen_server,call,[ns_config,get]}}

[xdcr:error,2012-12-02T8:20:01.874,ns_1@10.2.3.33:<0.1527.128>:capi_replication:update_replicated_docs:100][Bucket:"bucket-0", Vb:505]: update 170 docs takes too long to finish!(total time spent: 189 secs, defaul\
t connection time out: 180 secs)
[xdcr:error,2012-12-02T8:20:11.406,ns_1@10.2.3.33:<0.1531.128>:capi_replication:update_replicated_docs:100][Bucket:"bucket-0", Vb:432]: update 172 docs takes too long to finish!(total time spent: 199 secs, defaul\
t connection time out: 180 secs)


[xdcr:error,2012-12-02T8:33:13.109,ns_1@10.2.3.33:<0.14775.128>:xdc_vbucket_rep:terminate:284]Shutting xdcr vb replicator ({init_state,
                              {rep,
                               <<"41d1101e89da1d590261faef5067a4e8/bucket-1/bucket-1">>,
                               <<"bucket-1">>,
                               <<"/remoteClusters/41d1101e89da1d590261faef5067a4e8/buckets/bucket-1">>,
                               [{connection_timeout,180000},
                                {continuous,true},
                                {http_connections,20},
                                {retries,2},
                                {socket_options,
                                 [{keepalive,true},{nodelay,false}]},
                                {worker_batch_size,500},
                                {worker_processes,4}]},
                              488,<0.15362.0>,<0.15363.0>,<0.15357.0>}) down without ever successfully initializing: {badmatch,
                                                                                                                      {error,
                                                                                                                       all_nodes_failed,
                                                                                                                       <<"Failed to grab remote bucket info from any of known nodes">>}}




Comment by Junyi Xie (Inactive) [ 06/Dec/12 ]
From the log above, XDCR saw lots of timeout at different stages of replicator, e.g., fetching ns_config parameter during initialization, post data during replication, and even fetch remote vbucket map.
Comment by Aleksey Kondratenko [ 07/Dec/12 ]
Please, be more specific about what exactly you want me to help with.
Comment by Pavel Paulau [ 07/Dec/12 ]
From Junyi:

"Looks to me ep_engine or ns_server team need to triage the issue."

Both xdcr and ns_server team gave the runaround. You are last candidate.
Comment by Farshid Ghods (Inactive) [ 09/Jan/13 ]
per bug scrub
please rerun with the latest 2.0.1 build
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
per bug scrub

Ronnie,

are there results available from xdcr performance testing

Comment by Ronnie Sun (Inactive) [ 22/Jan/13 ]
I don't think so. Reassign to pavel.

Hi Pavel,

Is there a place we summarize xdcr results?

Thanks,
Ronnie
Comment by Pavel Paulau [ 24/Jan/13 ]
Not reproduced in 2.0.1 so far.
Generated at Thu Nov 27 11:19:39 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.