Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Critical
-
Resolution: Fixed
-
Affects Version/s: 2.0
-
Fix Version/s: 2.0
-
Component/s: cross-datacenter-replication
-
Security Level: Public
-
Labels:None
-
Environment:2.0-1856
Bidirectional replication
1024 vbuckets
EC2 centos
Description
- Setup a bidirectional replication between two 8:8 clusters on bucket b1.
- Setup a small front end load on cluster1 and cluster2 , 4K op/sec and 6K ops/sec.
[Load contains creates, updates, deletes]
- For the first 40M items, the replication is working as expected, the replication lag is small.
- Delete the replication from cluster2 to cluster1, recreate the replication.
[ Expected behaviour - Stop/Start replication.]
We expect that XDC will stop/start replication with the above step.
The last committed checkpoint will be checked and replication will continue from the last commited checkpoint.
Noticing a huge number of gets ~ 30K ops/sec and fewer sets - 2-3k ops/sec on the other cluster.
-The XDC queue is continuously growing, from < 500k to nearly 7M over a period of 2-3 hours.
- Seeing continous checkpoint_failures on both the XDC queues.
The Disk write queue on cluster1, is high ~ 2-3M. The drain rate however is fairly small ~ 30K.
The items are not drained fast enough and the disk-write-queue is getting filled up faster.
Adding screenshots from both the clusters.
The default values currently are -
XDCR_CHECKPOINT_INTERVAL:300
XDCR_CAPI_CHECKPOINT_TIMEOUT:10
@Junyi: I ve stopped the front end load on both the clusters now and I have passed on the cluster access.
Let me know if you need additional information.
- Setup a small front end load on cluster1 and cluster2 , 4K op/sec and 6K ops/sec.
[Load contains creates, updates, deletes]
- For the first 40M items, the replication is working as expected, the replication lag is small.
- Delete the replication from cluster2 to cluster1, recreate the replication.
[ Expected behaviour - Stop/Start replication.]
We expect that XDC will stop/start replication with the above step.
The last committed checkpoint will be checked and replication will continue from the last commited checkpoint.
Noticing a huge number of gets ~ 30K ops/sec and fewer sets - 2-3k ops/sec on the other cluster.
-The XDC queue is continuously growing, from < 500k to nearly 7M over a period of 2-3 hours.
- Seeing continous checkpoint_failures on both the XDC queues.
The Disk write queue on cluster1, is high ~ 2-3M. The drain rate however is fairly small ~ 30K.
The items are not drained fast enough and the disk-write-queue is getting filled up faster.
Adding screenshots from both the clusters.
The default values currently are -
XDCR_CHECKPOINT_INTERVAL:300
XDCR_CAPI_CHECKPOINT_TIMEOUT:10
@Junyi: I ve stopped the front end load on both the clusters now and I have passed on the cluster access.
Let me know if you need additional information.
Cluster1:http://ec2-50-18-16-89.us-west-1.compute.amazonaws.com/