Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Major
-
Resolution: Cannot Reproduce
-
Affects Version/s: 2.0.1
-
Fix Version/s: 2.1.0
-
Component/s: cross-datacenter-replication
-
Security Level: Public
-
Labels:None
-
Environment:2.0.1-160 Linux
Description
Hi Junyi,
On rebooting a node on destination cluster, do we expect that the replication to that node should restart?
XDCR re-replication will depend on whether there are any more open checkpoints to be replicated to. And if the cluster is in steady state .. no incoming mutations/ no incoming/outgoing xdcr traffic , we infer the cluster is in stead state.
Seeing some strange behavior
Source : Sending out some data in bursts
Destination: Doing mainly gets..
This is a unidirectional replication from Source->Destination. The cluster had no incoming load after replicating 33M items. Both source and destination clusters were in steady state.
Rebooted one node on destination.
And the Source replication is failing w/ error "Target database out of sync. Try to increase max_dbs_open at the target's server."
What is the expected behaviour on a "Reboot node" on either source/ destination? Why do we see these bursts of xdcr-traffic on the Source?
Adding screenshots from the source and destination cluster.
LInks :
Source: http://ec2-107-22-40-124.compute-1.amazonaws.com:8091/
Destination: http://ec2-54-235-229-199.compute-1.amazonaws.com:8091/
On rebooting a node on destination cluster, do we expect that the replication to that node should restart?
XDCR re-replication will depend on whether there are any more open checkpoints to be replicated to. And if the cluster is in steady state .. no incoming mutations/ no incoming/outgoing xdcr traffic , we infer the cluster is in stead state.
Seeing some strange behavior
Source : Sending out some data in bursts
Destination: Doing mainly gets..
This is a unidirectional replication from Source->Destination. The cluster had no incoming load after replicating 33M items. Both source and destination clusters were in steady state.
Rebooted one node on destination.
And the Source replication is failing w/ error "Target database out of sync. Try to increase max_dbs_open at the target's server."
What is the expected behaviour on a "Reboot node" on either source/ destination? Why do we see these bursts of xdcr-traffic on the Source?
Adding screenshots from the source and destination cluster.
LInks :
Source: http://ec2-107-22-40-124.compute-1.amazonaws.com:8091/
Destination: http://ec2-54-235-229-199.compute-1.amazonaws.com:8091/
1) reboot source node: all replication originating from that node will instantly shutdown, and after the node restarts, each replicator on that node will start from the LAST successful checkpoint, which in the worse case may have to rescan all replicated mutations in past 30 minutes
2) reboot target node: all replicators in source cluster may find they are unable to talk to the reboot node, and the replicator will crash, a new one will restart 30 seconds later and starts from the last successful checkpoint. If the replicator is doing checkpoint, it will fail with the error
"Target database out of sync. Try to increase max_dbs_open at the target's server"
and the replicator will shutdown itself and restart 30 seconds later, and rescan from the last successful checkpoint.
In either case, we will see a burst of "mutation to replicate", this is the number of mutations since last checkpoint. But it may drops down very quickly as we just need to rescan all the data without actually replicating them.