High intra-cluster xdcr bandwidth usage

nwood888 · January 12, 2016, 10:46pm

I’m seeing what appears to be unusually high bandwidth usage between nodes in a 3-node cluster that’s doing a 1-way xdcr replication to a single node cluster.

The 3 node cluster (source) has each node in a separate AWS zone within the same region
The 1 node cluster (destination) is in a different region than the first cluster

On the 1-node destination instance, I’m seeing ~16KB/s for both IN and OUT transfer
On each of the 3-node source instances, I’m seeing closer to 1MB/s for both IN and OUT transfer

So ~62 times more chatter between nodes in the source cluster. This seems really high to me. Is this expected or could something be wrong?

Both clusters are running 4.1.

I also tested to confirm that it’s XDCR by pausing replication (on 3 buckets). Traffic on the source nodes dropped to < 100 KB/s. When I un-paused replication, it jumped back up to ~1MB/s again.

Nick

anil · January 13, 2016, 2:04am

Hi Nick, If I understand your test setup correctly you have -
Cluster A - 3 nodes, 3 buckets
Cluster B - 1 node, 3 buckets
Uni-directional XDCR from Cluster A [3 buckets] => Cluster B [3 buckets]

Can you provide some additional information -

whats the workload on 3 buckets in Cluster A?
did you look into XDCR Stats for data_replicated (size in bytes), size_rep_queue.

nwood888 · January 13, 2016, 6:11am

That is correct.

For the workload, I’m seeing about 120 ops averaged over a week. 98% of the ops are reads. The average document size is in the 2k-3k range.

I’m not sure how to interpret the stats, so I hope it’s ok if I paste them here. I double-checked and the bandwidth usage is still high during this sample period.

data_replicated
{“samplesCount”:60,“isPersistent”:true,“lastTStamp”:1452664872802,“interval”:1000,“timestamp”:[1452664814802,1452664815802,1452664816802,1452664817802,1452664818802,1452664819802,1452664820802,1452664821802,1452664822802,1452664823802,1452664824802,1452664825802,1452664826802,1452664827802,1452664828802,1452664829802,1452664830802,1452664831802,1452664832802,1452664833802,1452664834802,1452664835802,1452664836802,1452664837802,1452664838802,1452664839802,1452664840802,1452664841802,1452664842802,1452664843802,1452664844802,1452664845802,1452664846802,1452664847802,1452664848802,1452664849802,1452664850802,1452664851802,1452664852802,1452664853802,1452664854802,1452664855802,1452664856802,1452664857802,1452664858802,1452664859802,1452664860802,1452664861802,1452664862802,1452664863802,1452664864802,1452664865802,1452664866802,1452664867802,1452664868802,1452664869802,1452664870802,1452664871802,1452664872802],“nodeStats”:{“node1:8091”:[24434738,24434738,24434738,24434738,24434738,24434738,24434738,24434738,24434738,24434738,24434738,24451894,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24469051,24470150,24470150,24470150,24470150,24470150,24490571,24490571,24490571,24559443,24559443,24559443,24579864,24579864,24579864,24593927,24593927,24593927,24593927,24607990,24622053,24690925,24690925,24690925,24690925,24690925,24704988,24704988,24719051,24719051],“node2:8091”:[24228332,24290379,24290379,24290379,24290379,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24312919,24437014,24437014,24437014,24485743,24486845,24486845,24513034,24513034,24513034,24539223,24539223,24561763,24561763,24561763,24562870,24562870,24562870,24562870,24589059,24589059,24615248,24615248,24667627,24667627,24667627,24667627,24667627,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708,24712708],“node3:8091”:[34604708,34623403,34623403,34623403,34623403,34623403,34623403,34642098,34642098,34679489,34710439,34710439,34737377,34737377,34737377,34737377,34767002,34767002,34828903,34885466,34885466,34885466,34912404,34912404,34912404,34912404,34912404,34912404,34912404,34912404,34942029,34968967,34995905,35081702,35108641,35167892,35263948,35263948,35263948,35263948,35263948,35263948,35263948,35263948,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35360004,35457160,35457160,35457160,35457160]}}

size_rep_queue
{“samplesCount”:60,“isPersistent”:true,“lastTStamp”:1452664974802,“interval”:1000,“timestamp”:[1452664916802,1452664917802,1452664918802,1452664919802,1452664920802,1452664921802,1452664922802,1452664923802,1452664924802,1452664925802,1452664926802,1452664927802,1452664928802,1452664929802,1452664930802,1452664931802,1452664932802,1452664933802,1452664934802,1452664935802,1452664936802,1452664937802,1452664938802,1452664939802,1452664940802,1452664941802,1452664942802,1452664943802,1452664944802,1452664945802,1452664946802,1452664947802,1452664948802,1452664949802,1452664950802,1452664951802,1452664952802,1452664953802,1452664954802,1452664955802,1452664956802,1452664957804,1452664958802,1452664959802,1452664960802,1452664961802,1452664962802,1452664963802,1452664964802,1452664965802,1452664966802,1452664967802,1452664968802,1452664969802,1452664970802,1452664971802,1452664972802,1452664973802,1452664974802],“nodeStats”:{“node1:8091”:[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],“node2:8091”:[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],“node3:8091”:[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}

nwood888 · January 13, 2016, 5:10pm

iftop indicates that the heavy bandwidth consumption happens on port 11210 - if that helps narrow down the issue.

nwood888 · January 13, 2016, 10:40pm

Something else I just found that was interesting. I just paused replication for our two “busy” buckets and left our 3rd bucket replicating. It dropped the bandwidth consumption down to ~333KB/s. This bucket literally has 0 writes and it only gets .0014 gets per second.

So it seems like the bandwidth that’s being consumed has relatively little to do with the number of documents or how busy the buckets are, but instead has a fairly static bandwidth footprint of 333KB/s per bucket.

Should I file a bug about this?

Nick

anil · January 14, 2016, 12:44am

Thank you Nick for further investigation and information.
Yes, please file an issue in our JIRA tracker referencing this post as well as providing cbcollect_info from the cluster.
Thank you
Anil Kumar

Topic		Replies	Views
Massive amounts of inter-zone traffic when enabling XDCR Couchbase Server	5	2189	May 9, 2016
DCP drain rate extremely high Couchbase Server	6	2434	January 17, 2016
Issue With The XDCR Couchbase Server	2	2359	August 4, 2016
XDCR performance drops dramatically after a period of time Couchbase Server	3	1954	July 10, 2013
Haven't heard from a higher priority node or a master, so I'm taking over Couchbase Server	8	5526	November 21, 2015

High intra-cluster xdcr bandwidth usage

Related topics