Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Critical
-
Resolution: Fixed
-
Affects Version/s: 2.0
-
Fix Version/s: 2.0
-
Component/s: cross-datacenter-replication
-
Security Level: Public
-
Labels:None
-
Environment:2.0-1856
Bidirectional replication
1024 vbuckets
EC2 centos
Description
- Setup a bidirectional replication between two 8:8 clusters on bucket b1.
- Setup a small front end load on cluster1 and cluster2 , 4K op/sec and 6K ops/sec.
[Load contains creates, updates, deletes]
- For the first 40M items, the replication is working as expected, the replication lag is small.
- Delete the replication from cluster2 to cluster1, recreate the replication.
[ Expected behaviour - Stop/Start replication.]
We expect that XDC will stop/start replication with the above step.
The last committed checkpoint will be checked and replication will continue from the last commited checkpoint.
Noticing a huge number of gets ~ 30K ops/sec and fewer sets - 2-3k ops/sec on the other cluster.
-The XDC queue is continuously growing, from < 500k to nearly 7M over a period of 2-3 hours.
- Seeing continous checkpoint_failures on both the XDC queues.
The Disk write queue on cluster1, is high ~ 2-3M. The drain rate however is fairly small ~ 30K.
The items are not drained fast enough and the disk-write-queue is getting filled up faster.
Adding screenshots from both the clusters.
The default values currently are -
XDCR_CHECKPOINT_INTERVAL:300
XDCR_CAPI_CHECKPOINT_TIMEOUT:10
@Junyi: I ve stopped the front end load on both the clusters now and I have passed on the cluster access.
Let me know if you need additional information.
- Setup a small front end load on cluster1 and cluster2 , 4K op/sec and 6K ops/sec.
[Load contains creates, updates, deletes]
- For the first 40M items, the replication is working as expected, the replication lag is small.
- Delete the replication from cluster2 to cluster1, recreate the replication.
[ Expected behaviour - Stop/Start replication.]
We expect that XDC will stop/start replication with the above step.
The last committed checkpoint will be checked and replication will continue from the last commited checkpoint.
Noticing a huge number of gets ~ 30K ops/sec and fewer sets - 2-3k ops/sec on the other cluster.
-The XDC queue is continuously growing, from < 500k to nearly 7M over a period of 2-3 hours.
- Seeing continous checkpoint_failures on both the XDC queues.
The Disk write queue on cluster1, is high ~ 2-3M. The drain rate however is fairly small ~ 30K.
The items are not drained fast enough and the disk-write-queue is getting filled up faster.
Adding screenshots from both the clusters.
The default values currently are -
XDCR_CHECKPOINT_INTERVAL:300
XDCR_CAPI_CHECKPOINT_TIMEOUT:10
@Junyi: I ve stopped the front end load on both the clusters now and I have passed on the cluster access.
Let me know if you need additional information.
Activity
Junyi Xie
made changes -
| Field | Original Value | New Value |
|---|---|---|
| Summary | Delete/Recreate replication on Bidirectional setup, causes continously growing XDC queue and checkpoint commit failures. | observe growing XDC queue and checkpoint commit failures in bi-directional XDCR with front-end workload |
Junyi Xie
made changes -
| Priority | Blocker [ 1 ] | Critical [ 2 ] |
Steve Yen
made changes -
| Assignee | Junyi Xie [ junyi ] | Pavel Paulau [ pavelpaulau ] |
Steve Yen
made changes -
| Sprint Priority | 2.5 |
Pavel Paulau
made changes -
| Assignee | Pavel Paulau [ pavelpaulau ] | Ketaki Gangal [ ketaki ] |
Junyi Xie
made changes -
| Assignee | Ketaki Gangal [ ketaki ] | Junyi Xie [ junyi ] |
Steve Yen
made changes -
| Summary | observe growing XDC queue and checkpoint commit failures in bi-directional XDCR with front-end workload | XDC queue grows and checkpoint commit failures in bi-directional XDCR with front-end workload |
Junyi Xie
made changes -
| Status | Open [ 1 ] | Resolved [ 5 ] |
| Resolution | Fixed [ 1 ] |
Ketaki Gangal
made changes -
| Status | Resolved [ 5 ] | Closed [ 6 ] |
Cluster1:http://ec2-50-18-16-89.us-west-1.compute.amazonaws.com/