[MB-6939] XDC queue grows and checkpoint commit failures in bi-directional XDCR with front-end workload Created: 16/Oct/12  Updated: 12/Nov/12  Resolved: 22/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ketaki Gangal Assignee: Junyi Xie (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.0-1856
Bidirectional replication
1024 vbuckets
EC2 centos

Attachments: PNG File Screen Shot 2012-10-16 at 6.37.34 PM.png     PNG File Screen Shot 2012-10-16 at 6.38.08 PM.png    

 Description   
- Setup a bidirectional replication between two 8:8 clusters on bucket b1.
- Setup a small front end load on cluster1 and cluster2 , 4K op/sec and 6K ops/sec.
[Load contains creates, updates, deletes]

- For the first 40M items, the replication is working as expected, the replication lag is small.

- Delete the replication from cluster2 to cluster1, recreate the replication.
[ Expected behaviour - Stop/Start replication.]

We expect that XDC will stop/start replication with the above step.
The last committed checkpoint will be checked and replication will continue from the last commited checkpoint.

Noticing a huge number of gets ~ 30K ops/sec and fewer sets - 2-3k ops/sec on the other cluster.

-The XDC queue is continuously growing, from < 500k to nearly 7M over a period of 2-3 hours.
- Seeing continous checkpoint_failures on both the XDC queues.

The Disk write queue on cluster1, is high ~ 2-3M. The drain rate however is fairly small ~ 30K.

The items are not drained fast enough and the disk-write-queue is getting filled up faster.

Adding screenshots from both the clusters.

The default values currently are -
XDCR_CHECKPOINT_INTERVAL:300
XDCR_CAPI_CHECKPOINT_TIMEOUT:10


@Junyi: I ve stopped the front end load on both the clusters now and I have passed on the cluster access.
Let me know if you need additional information.


 Comments   
Comment by Ketaki Gangal [ 16/Oct/12 ]
Cluster2 :http://ec2-54-245-1-10.us-west-2.compute.amazonaws.com:8091/
Cluster1:http://ec2-50-18-16-89.us-west-1.compute.amazonaws.com/
Comment by Junyi Xie (Inactive) [ 17/Oct/12 ]
The title is misleading. It is NOT deleting/restarting XDCR caused the checkpoint commit failure (see my explanation below). Modify the title.
Comment by Junyi Xie (Inactive) [ 17/Oct/12 ]
Look at your clusters. In your testcase, both clusters have some workload on top of ongoing bidirectional XDCR on the same bucket. XDCR workload is much heavier than your front-end workload. Also, there is little chance to do de-duplicate in ep_engine since the front-end workloads have different key sets. The write_queue on both clusters is constantly like 100K-500K items per node, while the disk drain rate is only 2-4 K/sec per node. Today the XDCR checkpoiting timeout is 10 seconds, that is if ep_engine is unable to persist the open checkpoint issued by XDCR within 10 seconds, we give up and skip this checkpoint, raise "target commit failure" error, and move on without checkpoint. In your testcase, apparently we are unable to persist the XDCR checkpoint in 10 seconds in most cases. This is the reason you see checkpoint failure on the UI on both sides. The root cause is that drain rate is unable to keep up with your workload (both XDCR and front-end workload).

By Chiyoung, now ep_engine has priority checkpoint command, which is used in rebalance. However, in XDCR, the use case is a bit different since we need to issue 32 concurrent checkpoints at the same time per node to the ep_engine. If XDCR issues the priority checkpoints directly, the performance impact to ep_engine as well as rebalance is unknown to both Chiyoung and me. Given the risk, it seems better to us to postpone this issue to post 2.0. Again the root cause is disk drain rate is too fast enough compared with the workload in the testcase.

At this time, I think what you can do is to increase the checkpoint interval and timeout. Say,

XDCR_CHECKPOINT_INTERVAL = 1800
XDCR_CAPI_CHECKPOINT_TIMEOUT = 60

That means we wait 60 seconds for checkpoint per 30 min (1800 seconds). The benefit is

1) We increase the chance to get successful checkpoint. It does not make sense to try to issue a checkpoint but always fail.

2) The aggregated waiting time is still the same as before, 60 secs per 30 min, no increase the time overhead of XDCR.


BTW, it should not be a blocker. The system is just doing what it is supposed to do, it is because we the reach the limit of concurrent design in your testcase.


Comment by Farshid Ghods (Inactive) [ 18/Oct/12 ]
Junyi,Chiyoung,Damein and Ketaki had a discussin about this earlier

more from chiyoung :


Pavel, Ketaki,

The toy build that uses the vbucket flushing prioritization for XDCR checkpointing is now available:

http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_toy-chiyoung-x86_64_2.0.0-1501-toy.rpm

As we discussed, please vary the number of concurrent vbucket checkpointing processes to 8, 16, and 32. In addition, we may need to vary the checkpointing interval because its default interval 5 minutes might be too often and expensive.

Thanks,
Chiyoung
Comment by Steve Yen [ 18/Oct/12 ]
Hi Pavel, so you have awareness on this.

Please also assign to Ketaki / work together on this.

Thanks,
Steve
Comment by Steve Yen [ 18/Oct/12 ]
latest news/update...

@pavelpaulau

Need a bi-dir performance test of 4node to 4node of the above toy build with its default setting of 32 concurrent vbucket checkpoint processes (MAX_CONCURRENT_REPS_PER_DOC), with front-end workload of 16K ops/sec per cluster, with 50% mutations, without views.
Comment by Steve Yen [ 19/Oct/12 ]
see yammer conversation and graphs here on results of Pavel's experiment...

  https://www.yammer.com/couchbase.com/#/Threads/show?threadId=224449346
Comment by Steve Yen [ 19/Oct/12 ]
More explanation on the toy-build change from Junyi...

The only difference between toybuild and build 1858 is that we use priority checkpoint to persist XDCR checkpoint in the toybuild, while in 1858, we use normal checkpoint.

The issue with normal checkpoint is that given the current drain rate, it is almost impossible to persist a checkpoint in 10 seconds in large scale test. The negative result is, 1) it caused a bunch of "target commit error" at source; 2) it made XDCR lose most checkpoints and paying 10 seconds per each for nothing; 3) even worse, it delayed the replication at least 30 seconds per vb replicator since source side need to restart the vb replicator, which in consequence may increase the replication backlog at source.

By the logs from Pavel's clusters, with the new priority checkpoint (that is also used to improve rebalance with consistent view), I see XDCR is able to persist around 82% of all checkpoints issued, hence removing some delay and errors seen in normal build (in which case most checkpoints failed). Apparently XDCR itself can benefit from this priority checkpoint without any concern. That is the reason we see smaller XDCR backlog.

The question to be answered now, is how this may impact other components like rebalance and normal vb checkpoints, if XDCR has to issue 32 priority checkpoints every 5 minutes. It is not quite good to benefit XDCR at the cost of others. That is th reason why Chiyoung suggested using longer checkpoint interval and small currency. Personally I am ok with the former but intended not to reduce the parallelism because that may impact XDCR performance significantly. However, the fundamental solution is to improve drain rate. it wont be easy and probably storage team will work on it post-2.0. « collapse
Comment by Farshid Ghods (Inactive) [ 19/Oct/12 ]
The results from performance tests are very promising and before we run system tests we need to find out the best value for how often we want to persist the checkpoint.
Ketaki,

what is your take on increasing the interval that we persist checkpoints from 5 minutes to 30 minutes or 60 minutes
the idea behind persisting the checkpoints is that if replication was stopped and deleted by the user and restarted we dont restart everything.
Comment by Ketaki Gangal [ 19/Oct/12 ]
If biXDCR minus the stop/start replication does not show the huge growing queue, yes it would be preferable to adjust xdcr-specific checkpoint intervals rather than affect other components.

Can we have perf results w/ changes on the parameters to see if this solving the issue as well.
XDCR_CHECKPOINT_INTERVAL:300 *6
XDCR_CAPI_CHECKPOINT_TIMEOUT:10 *6

Based of the previous discussions on these values, it is possible that it may/not resolve the issue here.
Comment by Steve Yen [ 19/Oct/12 ]
removing "observe" from summary so i don't confuse this with "observe"
Comment by Junyi Xie (Inactive) [ 19/Oct/12 ]
Ketaki,

Yea we can test the parameter as you suggested to see how it works. Please note to use normal build instead of the toybuild.

On the toybuild, XDCR_CAPI_CHECKPOINT_TIMEOUT is not longer valid and has no impact since we switch to priority checkpoint.
Comment by Junyi Xie (Inactive) [ 19/Oct/12 ]
I still see a bunch of checkpoint_commit_failure even we increase the XDCR_CAPI_CHECKPOINT_TIMEOUT to 60 seconds. This is because as long as the drain rate is unable to catch up with workload, we will eventually build a big disk write queue and
thus XDCR will fail to persist any checkpoint even in 60 seconds.

I think we need to 1) merge the priority checkpoint commit; 2) increase the checkpoint interval if it is a concern to ep_engine
Comment by Junyi Xie (Inactive) [ 22/Oct/12 ]
The fix is to use priority checkpoint instead of regular checkpoint

Chiyoung and I created 3 commits to address the issue.

On ns_server side

http://review.couchbase.org/#/c/21730/
http://review.couchbase.org/#/c/21799/


On ep_engien side, it is under a different bug

http://review.couchbase.org/#/c/21857/

Comment by Ketaki Gangal [ 25/Oct/12 ]
Works as expected on 1893
Comment by kzeller [ 12/Nov/12 ]
Added to RN: XDCR checkpoint intervals have increased to 30 minutes from 5
minutes. This helps
increase the chance that a checkpoint will successfully replicate and not
fail; this also reduces the frequent overhead required to determine
if a checkpoint completed.
Generated at Fri Apr 18 00:08:25 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.