[MB-7657] [RN 2.0.1] XDCR docs to replicate for bucket queue is draining unevenly. Created: 31/Jan/13  Updated: 19/Jun/13  Resolved: 19/Jun/13

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 2.0.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Chisheng Hong (Inactive) Assignee: Ketaki Gangal
Resolution: Cannot Reproduce Votes: 0
Labels: regression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: couchbase-2.0.1-144-rel on Centos 5.4 on both source and destination

Attachments: PNG File 201_kvonly_drain02.png     PNG File 201_kvonly_drain.png     PNG File Screen Shot 2013-01-31 at 8.33.52 PM.png     PNG File Screen Shot 2013-02-12 at 9.21.19 PM.png     PNG File Screen Shot 2013-02-12 at 9.21.50 PM.png     PNG File Screen Shot 2013-02-13 at 12.00.22 PM.png     PNG File Screen Shot 2013-02-13 at 12.01.13 PM.png     PNG File Screen Shot 2013-02-13 at 12.28.15 PM.png     PNG File Screen Shot 2013-02-13 at 12.29.01 PM.png     PNG File Screen Shot 2013-02-16 at 5.36.01 PM.png     PNG File Screen Shot 2013-02-16 at 6.19.49 PM.png     PNG File Screen Shot 2013-02-18 at 7.13.05 PM.png     PNG File Screen Shot 2013-02-18 at 7.31.00 PM.png     PNG File Screen Shot 2013-02-18 at 7.43.45 PM.png     PNG File Screen Shot 2013-02-18 at 8.05.19 PM.png     PNG File Screen Shot 2013-02-18 at 8.06.04 PM.png    
Flagged:
Release Note

 Description   
linux cluster
22-node-ec2-10GB-Linux

source Cluster ( 15 Nodes)
create default bucket -3.5 GB per node
sasl bucket - 2.0 GB per node

destination Cluster ( 7 Nodes)
create default bucket - 3.5 GB per node
sasl bucket - 2.0 GB per node

**Loading phase
Source cluster:
Default bucket
define a workload that loads 40 M json items , key size 128-512 bytes to default bucket.
The workload should push the system into a light dgm ( active resident ratio at 90 percent)

**Access Phase:
Source cluster:
Default bucket
create:5%, uupdate:10%, get:80%, delete:5% with cache miss:5%, opsPerSec:30000, running for 8 hours

Then start replication:
set a bi-directional XDCR on default bucket
Source cluster:
default bucket
create:5%, uupdate:10%, get:80%, delete:5% with cache miss:5%, opsPerSec:30000, running for 2 hours

Destination cluster:
default bucket
define a workload that loads 30 M json items(non-conflicting key-sets) , key size 128-512 bytes to default bucket.

After all the loads on both clusters stop, wait for "XDCR docs to replicate for bucket" drops. But really slow.
Saw XDCR docs to replicate for bucket queue is draining unevenly on default bucket on source
The screen shot is from source cluster.

 Comments   
Comment by Chisheng Hong (Inactive) [ 31/Jan/13 ]
Diags links:
https://s3.amazonaws.com/bugdb/jira/MB-6799/ns-diag-20130131231808.txt.zip
Comment by Chisheng Hong (Inactive) [ 31/Jan/13 ]
Forget about the bug number in the diags link path. It doesn't matter. The above link give u the right diags
Comment by Chisheng Hong (Inactive) [ 31/Jan/13 ]
The destination cluster diags (the previous link is for the source):
https://s3.amazonaws.com/bugdb/jira/MB-6799/ns-diag-20130201000242.txt.zip
Comment by Frank Weigel [ 04/Feb/13 ]
How commonly does this happen and how closely is it tied to toe level of load on the source cluster (i.e. the 30k ops)
Comment by Jin Lim [ 04/Feb/13 ]
Bumping this to "critical" for prompt engineering investigation. It can turn out to be a common case (i.e. users of XDCR can easily run into this) thus negatively impact overall level of load on the source cluster.
Comment by Jin Lim [ 11/Feb/13 ]
sync up with tomorrow's system test + make sure whether it is a regression or existing pls.
Comment by Jin Lim [ 11/Feb/13 ]
Junyi and Chiseng, please communicate to see if this is a real regression.
Comment by Ketaki Gangal [ 12/Feb/13 ]
Will post results based on System test on build 153.
Comment by Ketaki Gangal [ 12/Feb/13 ]
Adding results from 2.0.1-153-rel build. ( screenshot for docs-to-replicate)

Still seeing unevennes on replication from source nodes (11 node-> 5 node cluster, bidirectional replication)

Clusters are accessible here
Source : http://ec2-23-20-140-198.compute-1.amazonaws.com:8091/index.html
Destination : http://ec2-23-20-108-25.compute-1.amazonaws.com:8091/
Comment by Jin Lim [ 12/Feb/13 ]
Thanks Ketaki. If can can you also provide some inputs for Frank's questions above? We really want to understand how likely (easily) users can run into this issue.
Comment by Ketaki Gangal [ 12/Feb/13 ]
Seeing this fairly commonly on large-scale runs for xdcr. With a higher front end ops/sec ( 30k ops/sec + for a 11 node cluster), seeing unevennes on distributions of docs-to-replicate queue.

The tests reduce the front end load in the successive phases of the runs, and all the xdcr-data is synced as expected.

We see this unevenness on inital replication and in-between phases typically.
Comment by Ketaki Gangal [ 12/Feb/13 ]
Logs from some source nodes
https://s3.amazonaws.com/bugdb/MB-7657/bug.tar
Comment by Junyi Xie (Inactive) [ 13/Feb/13 ]
Ketaki,

Thanks a lot for your test and logs.

I looked at the two clusters on EC2 now (12:29PM EST, Feb 13), your bi-directional XDCR test has been running for a bit short of 20 hours and in general it is healthy to me.

From the per node "docs to replicate" stats on two cluster, at day scale, I do not see unevenness. Please look at the 1st and 2nd screenshots I uploaded.

I also looked at the log from source you uploaded, around 2013-02-12T23:08, and 2013-02-12T23:21, there are several timeout error which caused source replicator to crash, in all these errors the destination is ec2-23-20-217-189. That means in a short time period, that node is not able to responde timely to source nodes in XDCR. But source node is able to recover under this circumstance. Other than that, I do not see any other errors in the log.

Looks like around 2013-02-12T23:08 and 2013-02-12T23:21 is the peak time where you have highest ops/sec and most docs to replicate. In particular, at that time, you have around 30k ops/sec on source and >30k ops/sec on destination. On the source the update ops/sec is > 10k ops/sec (see the 3rd and 4th screenshot I uploaded). Cannot judge any small scale unevenness from day-scale stats.

In short, from your clusters and logs, there is no unevenness for long time through the test.

The timeout errors may explain why you see the unevenness in a short period of time, other factors like compactors and env may also contribute to that. But I am not sure why it is serious.


Comment by Ketaki Gangal [ 13/Feb/13 ]
Do we know why we see these timeouts under heavier load btw?

Comment by Junyi Xie (Inactive) [ 13/Feb/13 ]
Yes, basically the revs_diff and bulk_docs are unable to finish at destination cluster within connection timeout. I remember back in last December, we increased the timeout to 180 seconds.
Comment by Jin Lim [ 13/Feb/13 ]
Based on discussions so far, we find that:
* users can run into this issue pretty easily with large scale environment
* restarting replicators after the described timeout crash is fairly cheap as far as Erlang process management concerns
* eventually as load pressure on dst cluster improves so the unevenness gets subsidized by faster replication turnaround
* obvious negative impact is that restarted replicators start from the last checkpoint that previously completed (thus replication is behind one checkpoint interval, approx. 5 minutes)

Action plan:
* will move this to 2.0.2
* xdcr team, please continue investigate it. We cannot afford to just blame env. or other feasible factors unless they are proven with solid evidence.
* qe (Ketaki) is currently running the same tests with 1/4 of replicators, validating if this reduces the chance of running into the symptom
* if less number of replicator helps, we will update the document with clear description of symptom and the workaround of reducing the # of replicator.

Thanks,
Jin
Comment by Jin Lim [ 13/Feb/13 ]
Hi Ketaki, please review the above comment and update this bug with your finding after running with less number of replicators. Afterward, please assign it back to Junyi or Jin for next action.
Comment by Junyi Xie (Inactive) [ 13/Feb/13 ]
Jin,

Can you please explain a bit "users can run into this issue pretty easily with large scale environment"? From Ketaki's test, looks to me that is not the case (please see the screenshot from her clusters, where is the unevenness?)

Also, if it is due to the timeout (very limited timeout in Ketaki's test), then it is nothing to do with size of cluster.
Comment by Jin Lim [ 14/Feb/13 ]
Hi Junyi, the summary was based on previous conversations among us. Do you suggest that there is no unevenness issue based on the screenshots? I am not sure if unevenness (spike or drop) can be easily identified based on normalized day scale graphs. Playing with the number of replicators is a workaround try to see if that reduces the number of timeout thus alleviates the unevenness.
Comment by Junyi Xie (Inactive) [ 14/Feb/13 ]
Hi Jin,

Probably there is some miscommunication. The spike/drop in Ketaki's is NOT unevenness. Here the unevenness means the different nodes drain the queue at very different speeds, as demonstrated by Chisheng's original test (look at Chisheng's 1st screenshot please, so nodes have 0 mutations to replicate, some have millions). The spike/drop in Ketaki's test comes from the significant change in front-end workload, which is expected. Also, the spike/drop in Ketaki's test is "mutations to replicate", NOT the replication speed, so it is not that the bucket is unevenly draining.

From the screenshot I uploaded, overall each node was working at the same pace and I do not see significant unevenness. Therefore I am not sure if we are able to easily reproduce what Chisheng observed.

That is why I am not sure about your notes that "users can run into this issue pretty easily with large scale environment".


Thanks.


 
Comment by Jin Lim [ 14/Feb/13 ]
Hi Junyi, thanks for your detailed explanation. I will sync up with QE and take their latest update and input for your comment. We then should be able to move this to 2.0.2. Will discuss more during the xdcr - kv test meeting this afternoon. The action items below are still valid though:

Action plan:
* will move this to 2.0.2
* xdcr team, please continue investigate it (unless there are 2.0.1 hot issues)
* qe (Ketaki) is currently running the same tests with 1/4 #s of replicators
* if less number of replicator helps, we will update the document with clear description of symptom and the workaround of reducing the # of replicator.

 
Comment by Junyi Xie (Inactive) [ 14/Feb/13 ]
As discussed over meeting, if our customer hit this issue, other than the regular diag logs, we need them provide

1. stat screenshot at least in minute scale of all nodes on which we can tell the unevenness happen
2. description of workload, 1) how long has the test been running 2) the front-end ops/sec on both source and destination clusters, how many of them are updates of existent items, how many are new items (inserts), etc.

reducing # of replicators and increase timeout may possibly help.

Comment by Jin Lim [ 15/Feb/13 ]
Discussed with QE and it appears to be that users can still easily run into the symptom when running in a large-scale env. Junyi, do you concur with this assessment?
Comment by Jin Lim [ 15/Feb/13 ]
Documentation team, please review comments starting from Jin's 2/13 for information regarding this bug. Thanks.
Comment by Junyi Xie (Inactive) [ 15/Feb/13 ]
I can tolerate some unevenness due to env and skewed workload, but if that unevenness is due to some destination node crash/down (as we have seen in another bug, 1-2 of 4 nodes at destination are not reachable, cause some source nodes stuck and unable to replicate), that is a different story.

 
Comment by Junyi Xie (Inactive) [ 15/Feb/13 ]
I looked at Ketaki's 10:5 bi-directional xdcr test at

cluster 1
http://ec2-23-20-140-198.compute-1.amazonaws.com:8091/index.html#sec=buckets
cluster 2
http://ec2-23-20-108-25.compute-1.amazonaws.com:8091/index.html#sec=buckets

so far they look healthy and no unevenness is found.


From bug fix perspective, I cannot exclude the possibility of unevenness at this time, therefore I will make some defense line in code which I hope can prevent unevenness across nodes from happening or recover from it in case it happens.

Jin,

You can determine if you want merge it to 2.0.1 or 2.0.2. Thanks.




Comment by Junyi Xie (Inactive) [ 15/Feb/13 ]
fix on gerrit

http://review.couchbase.org/#/c/24645/
Comment by Junyi Xie (Inactive) [ 16/Feb/13 ]
I manage to reproduce the uneven "mutation to replicate" stats across nodes.

Looks to me, the root cause (at least in my testcase) is that the disk queues at different nodes drain at VERY different rate. My testcase is a simple 2->2 uni-directional with tiny front-end insert workload (on average 250 insert / sec / node, no get ops at all). Over time, the disk write queue size at two source nodes are quite skewed. One is almost empty, while the other is 150k-200k items.

Consequently, the mutations to replicate at these two nodes are also quite different, the node with empty write queue has 150K items more than the other to replicate because the ep_engine on that node has dumped all items to disk.

The other node with 150K items backlog in disk write queue has almost 0 "mutations to replicate" because on that node, XDCR is faster than write queue drain rate.

This is consistent with what Chisheng's original test. Please see the screen shots of disk write queue I grabbed at different time of test.



ep_engine team, please advise. Thanks.

Comment by Junyi Xie (Inactive) [ 16/Feb/13 ]
assign to ep_engine team for comment
Comment by Mike Wiederhold [ 16/Feb/13 ]
If the root cause is what Junyi mentioned above then this seems like it would be the expected behavior since if one node in the cluster is slow to drain items to disk then we will see this unevenness. What likely caused the disk write queue to be high on one node is compaction. In our teams testing we have found that disk persistence gets very slow when compaction is happening and if compaction only happens on one of the nodes I would expect this situation. I can take a look at the logs later if anyone wants me to look into this more deeply. Just let me know.

Also note that Aaron is doing some work on the compactor logic in couchstore to try to speed up persistence when compaction is taking place.
Comment by Jin Lim [ 18/Feb/13 ]
Thanks for updates. As we previously discussed, will be moving this to 2.0.2 for further investigation and retesting with mentioned optimization. There is a toy build currently waiting for test that should help alleviate the unevenness from the XDCR side and QE is also planning to test with a workaround (less #s of replicator).

Ketaki, please update any latest test result from either the workaround or the toy build (http://review.couchbase.org/#/c/24645/). Thanks much!
Comment by Ketaki Gangal [ 18/Feb/13 ]
Tested this on 2.0.1-160-rel build and discussed this w/ Junyi

- Seeing some nodes ( 1 node in particular) lagging behind on the xdcr-replication start.
The drain rate , disk write queue on this node are very uneven, attached screenshot.

- Uneven CPU across the cluster, some nodes show avg.80 percent, some show avg. 50 percent and some show less than 20 percent avg over a period of time. Screen shot attached.

We ve compared the Compaction timing w/ the drain/disk timings and they are not in exact sync.

There is some resource (i/o or otherwise) on this node, causing it to be varying on the disk/draining and showing this intermittent behaviour.

_ On another note - Bidirectional replication on this cluster causes the clusters to become unresponsive and serveral timeout errors on the stats display.

Adding logs from nodes on the source( 11 node cluster).
Comment by Ketaki Gangal [ 18/Feb/13 ]
Source : http://ec2-107-22-40-124.compute-1.amazonaws.com:8091/
Destination :http://ec2-54-242-239-237.compute-1.amazonaws.com:8091/


Comment by Junyi Xie (Inactive) [ 18/Feb/13 ]
Thanks Ketaki for the notes.

One more thing, the node with 20% CPU usage is exactly the one lagging behind in XDCR, it is also the one which demonstrated quite different flusher behavior (disk write queue, drain rate, etc) from other nodes.

I am not claiming the flusher is the culprit, on the contrary, I think both flusher and XDCR are victim of some I/O heavy job (possibly compaction, need verification): flusher was unable to flush quickly and XDCR was unable to read and replicate doc at similar rate as other nodes.

Given the low CPU usage on that node, looks to me that node is pretty I/O bound.
Comment by Ketaki Gangal [ 18/Feb/13 ]
https://s3.amazonaws.com/bugdb/bug0/bug.tar
Comment by Jin Lim [ 18/Feb/13 ]
Thanks for many updates. Per bug scrubs this morning, we want to try the the same test with disabling compaction and see it it alleviates the issue. Also, it may be worth a try to just increase compaction threshold from default 30% to something higher like 75 - 80%. I will leave this uptp QE to decided however test it with compaction.
Comment by Junyi Xie (Inactive) [ 19/Feb/13 ]
Thanks Jin.

We can repeat the test without compaction, But the question yet to be answered is why only 1 node demonstrated such different behavior from other nodes
(other 2 nodes also behaved a bit differently but not that bad). IMHO compaction should behave more or less similarly on all nodes since in the cluster, since we have random distributed workload.

I am not sure it is a bug or not, but since it is fairly easy to reproduce, we should keep an eye on that.

Other than the test to turn on/off compaction, I am also interested in a simple KV test: turn off all XDCR/view/query/rebalance activity completely, just put some pressure on one cluster with 10 nodes with some non-trivial front-end workload for several hours, and observe how flusher behaves across all nodes. If QE has such experience, please share.

Comment by Farshid Ghods (Inactive) [ 19/Feb/13 ]
Junyi,

Thanks for more investigation. QE should be able to extract such data from the k/v test case and provide you access to such cluster.

Ketaki,

can we extract the information that Junyi has asked for from k/v cluster
Comment by Junyi Xie (Inactive) [ 19/Feb/13 ]
Farshid,

Thanks. Please share with the data or point me to the live cluster (if possible) in the KV system test.

It will be interesting how our CB server behaves and adapts when we put different pressure on source at different time (for example, we can put small, moderate, heavy, spike, burst workloads at different stages of KV test).
Comment by Ketaki Gangal [ 19/Feb/13 ]
Sure, Adding in Tommie on this discussion.

@Tommie - Do we have information on overall cluster v/s individual node behaviour on the KV usecase currently?
Comment by Ketaki Gangal [ 19/Feb/13 ]
Hi Junyi,

For this setup, would the drain-rate on the source node the xdcr-replication ?

- All the items are persisted to disk in the load phase.
- In the access phase, 5 percent items are sets, rest 90 percent plus are reads.

So even if this particular node is not draining evenly... how would that impact xdcr ability to read from disk and replicate forward?

-Ketaki
Comment by Junyi Xie (Inactive) [ 19/Feb/13 ]
As I mentioned, uneven drain rate may not be the reason of make XDCR more difficult read from disk, in stead of both flusher and XDCR can be victim of other activity (probably compactor, but I am not 100% sure) which were competing I/O with flusher and XDCR, making flusher harder to flush and XDCR harder to read doc.

That is the reason we saw correlated abnormal activity of flusher and XDCR
Comment by Jin Lim [ 20/Feb/13 ]
* Still waiting for the test results from Tommie for KV only system.
* Disabling compaction didn't much help on Ketaki's rerun
Comment by Tommie McAfee [ 20/Feb/13 ]
as far as draining goes in kv-only use-case, looks like all nodes are draining at same rate...(screen 201_kvonly_drain)
Comment by Jin Lim [ 21/Feb/13 ]
Tommie's result is in, unevenness is NOT seen on his kv-only test. As discussed in the bug scrubs, engineering team will continue to investigate this in 2.0.2 as it appears to be intermittent behavior without major timeout or performance degradation.

This bug has taken a way too long of QE times and cycles to prove 1) it is reproducible symptom 2) none of suggested workarounds (compactor, # of replicator, etc) worked 3) every single data points for engineering debugging. Many thanks for the hardworking on this, QE team!
Comment by Junyi Xie (Inactive) [ 21/Feb/13 ]
Please look at the the screen shot Tommie posted more carefully. The scale is 2-3M, not 10k, so even a very small bump will cause 50 -100K difference in disk queue. Obviously node 10.3.3.215 shows quite BIG unevenness, during first 60-70% of test, before increasing. If you convert the figure into logarithmic scale, it could be much easier to see. The disk write queue unevenness in our XDCR test is only 10K, but over time it can accumate quite a lot backlog. Also, please note, Tommie'test is without XDCR, there are less I/O competition for compactor and ep_engine, we still see several hundred K difference in disk write queue.


Comment by kzeller [ 21/Feb/13 ]
Added to 2.0.1 RN:

You may experience an even rate of replication for nodes in a cluster. This is
intermittent behavior that is due to different disk drain rates at different
nodes and will subside by itself over time. There is no need to change any
settings on your cluster to try to resolve this behavior.
Comment by Junyi Xie (Inactive) [ 21/Feb/13 ]
Hi Karen,

Is it "even rate" or "uneven rate"? I think users may experience uneven replication rate.
Comment by kzeller [ 21/Feb/13 ]
You're right "uneven"
Comment by Maria McDuff (Inactive) [ 21/Mar/13 ]
will re-visit once multiple reader/writer is in. this is 2.0.2 must fix.
Comment by Maria McDuff (Inactive) [ 25/Mar/13 ]
note to qe (maria): once multiple rdr/writer in, QE to run test.
Comment by Mike Wiederhold [ 27/Mar/13 ]
Moving to 2.1 since we need to retest with the multi-reader/writer to see if that solves the problem. We can move this back to 2.0.2 if the ep-engine team has time to look at it.
Comment by Jin Lim [ 11/Apr/13 ]
QE will retest this once the large scale test started for 2.0.2 and we will determine if improvements made to 2.0.2 (mrw, io batching, etc) addressed the issue or not. Thanks.
Generated at Fri Jul 11 03:07:25 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.