High intra-cluster xdcr bandwidth usage

I’m seeing what appears to be unusually high bandwidth usage between nodes in a 3-node cluster that’s doing a 1-way xdcr replication to a single node cluster.

The 3 node cluster (source) has each node in a separate AWS zone within the same region
The 1 node cluster (destination) is in a different region than the first cluster

On the 1-node destination instance, I’m seeing ~16KB/s for both IN and OUT transfer
On each of the 3-node source instances, I’m seeing closer to 1MB/s for both IN and OUT transfer

So ~62 times more chatter between nodes in the source cluster. This seems really high to me. Is this expected or could something be wrong?

Both clusters are running 4.1.

I also tested to confirm that it’s XDCR by pausing replication (on 3 buckets). Traffic on the source nodes dropped to < 100 KB/s. When I un-paused replication, it jumped back up to ~1MB/s again.

Nick

Hi Nick, If I understand your test setup correctly you have -
Cluster A - 3 nodes, 3 buckets
Cluster B - 1 node, 3 buckets
Uni-directional XDCR from Cluster A [3 buckets] => Cluster B [3 buckets]

Can you provide some additional information -

  • whats the workload on 3 buckets in Cluster A?
  • did you look into XDCR Stats for data_replicated (size in bytes), size_rep_queue.

That is correct.

For the workload, I’m seeing about 120 ops averaged over a week. 98% of the ops are reads. The average document size is in the 2k-3k range.

I’m not sure how to interpret the stats, so I hope it’s ok if I paste them here. I double-checked and the bandwidth usage is still high during this sample period.

data_replicated
{“samplesCount”:60,“isPersistent”:true,“lastTStamp”:1452664872802,“interval”:1000,“timestamp”“nodeStats”:{“node1:8091”:[24434738,24434738,24434738,24434738,24434738,24434738,24434738,24434738,24434738,24434738,24434738,24451894,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24470150,24470150,24470150,24470150,24470150,24490571,24490571,24490571,24559443,24559443,24559443,24579864,24579864,24579864,24593927,24593927,24593927,24593927,24607990,24622053,24690925,24690925,24690925,24690925,24690925,24704988,24704988,24719051,24719051],“node2:8091”:[24228332,24290379,24290379,24290379,24290379,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24437014,24437014,24437014,24485743,24486845,24486845,24513034,24513034,24513034,24539223,24539223,24561763,24561763,24561763,24562870,24562870,24562870,24562870,24589059,24589059,24615248,24615248,24667627,24667627,24667627,24667627,24667627,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708],“node3:8091”:[34604708,34623403,34623403,34623403,34623403,34623403,34623403,34642098,34642098,34679489,34710439,34710439,34737377,34737377,34737377,34737377,34767002,34767002,34828903,34885466,34885466,34885466,34912404,34912404,34912404,34912404,34912404,34912404,34912404,34912404,34942029,34968967,34995905,35081702,35108641,35167892,35263948,35263948,35263948,35263948,35263948,35263948,35263948,35263948,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35457160,35457160,35457160,35457160]}}

size_rep_queue
{“samplesCount”:60,“isPersistent”:true,“lastTStamp”:1452664974802,“interval”:1000,“timestamp”“nodeStats”:{“node1:8091”:[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],“node2:8091”:[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],“node3:8091”:[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}

iftop indicates that the heavy bandwidth consumption happens on port 11210 - if that helps narrow down the issue.

Something else I just found that was interesting. I just paused replication for our two “busy” buckets and left our 3rd bucket replicating. It dropped the bandwidth consumption down to ~333KB/s. This bucket literally has 0 writes and it only gets .0014 gets per second.

So it seems like the bandwidth that’s being consumed has relatively little to do with the number of documents or how busy the buckets are, but instead has a fairly static bandwidth footprint of 333KB/s per bucket.

Should I file a bug about this?

Nick

Thank you Nick for further investigation and information.
Yes, please file an issue in our JIRA tracker referencing this post as well as providing cbcollect_info from the cluster.
Thank you
Anil Kumar