[MB-6919] Possibly incorrect OUTBOUND XDCR OPERATIONS seen on source cluster, very slow replication (very less number of sets) observed with number of items > 200M Created: 15/Oct/12  Updated: 10/Jan/13  Resolved: 18/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Abhinav Dangeti Assignee: Junyi Xie (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 64bit ec2-ubuntu-12.04 LTS
Build 1844

Attachments: Zip Archive ns-diag-20121015185325.txt.zip     PNG File Screen Shot 2012-10-15 at 11.49.00 AM.png     PNG File Screen Shot 2012-10-15 at 11.49.29 AM.png     PNG File Screen Shot 2012-10-16 at 2.23.33 PM.png     PNG File Screen Shot 2012-10-16 at 2.25.04 PM.png    

 Description   
Cluster set up:
c1 : c2 :: 10 : 10

sbucket: c1 -> c2
default: c2 -> c1

>> Replication set up with continuous front end load
>> Front end load for default = ~10K ops per sec
>> Front end load for sbucket = ~4-5K ops per sec
>> Average replication seen on c1 (for default): ~12-14K ops per sec
>> Average replication seen on c2 (for sbucket): ~15-18K ops per sec

At a particular snapshot, on C1:
{With same amount of load (mixed), on bucket "sbucket"}

No. of items: 214M
No. of items in replication queue: 136M (way too high)
Secs in replicating = 0 (!?)
Secs in checkpointing = 385 (!?)
Checkpoints issued = 79 (!?)

These stats being for a cluster that's been up running with continuous load and replication for ~65hrs.

Also seen on the destination C2:
{on bucket "sbucket"}
Gets per sec: 19.2K
Sets per sec: 347 (seems very low)

Also seeing a number of these errors on the XDCR tab on the source:

2012-10-15 19:17:50 - Error replicating vbucket 397: {http_request_failed, "POST", "http://Administrator:*****@ec2-175-41-177-173.ap-southeast-1.compute.amazonaws.com:8092/sbucket%2f397%3bc8731525718bcbdd0bf0382e420c453f/_revs_diff", {error,{error,timeout}}}

2012-10-15 19:17:50 - Error replicating vbucket 381: {http_request_failed, "POST", "http://Administrator:*****@ec2-175-41-177-173.ap-southeast-1.compute.amazonaws.com:8092/sbucket%2f381%3bc8731525718bcbdd0bf0382e420c453f/_revs_diff", {error,{error,timeout}}}

....

Load on sbucket with mcsoda:
 lib/perf_engines/mcsoda.py sbucket@ec2-50-18-140-172.us-west-1.compute.amazonaws.com:11211 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.03 expiration=60 ratio-deletes=0.5 min-value-size=1000 threads=30 max-items=100000000 exit-after-creates=2 prefix=KEY1_ max-creates=100000000

Load on default with cbworkloadgen:
/opt/couchbase/bin/tools/cbworkloadgen -n ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:8091 -r .7 -i 400000000 -s 128 -t 30 -p KEY3_

Killed the front end load on both the buckets right now, replication seen catching up.
Live clusters:
c1: http://ec2-50-18-140-172.us-west-1.compute.amazonaws.com:8091/
c2: http://ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:8091/

Attached grabbed diags for the orchestrator on c1.

 Comments   
Comment by Junyi Xie (Inactive) [ 15/Oct/12 ]
1. The stats "docs to replicate" is not the # of items accumulated in changes queue. It is the stats reported by CouchDB (109M docs on your screenshot).
Stat "docs in the queue" is the real number of docs in the queue, which 1.33M. I think I explained this in my earlier email to xdcr-eng.

Today, to reduce the memory overhead of XDCR, the queue is limited to max 4k items and 400KB per active vbucket. By some simple math, you can compute that
the number of items in queue per clusters is about 4K *32*10 = 1.3M items, which is consistent with your observation. This is expected behavior.

In your case, since you have large clusters of 10 nodes, you may try higher number of concurrent replications from 32 to 100 to see any difference.

2. You have expired items in your workload but with very small expiration time 60 seconds, given the size of your workload, it will mostly like expire before replicating to the remote cluster, that is why you see high XDCR ops but low set/deletes. I strongly suggest you exclude expired items from your test since it just created lots of confusion.



Without further information, this is pretty much all I can say. Please let me know how to log onto the EC2 node and how to access the logs.


Comment by Abhinav Dangeti [ 15/Oct/12 ]
I understand, to log into the ec2 nodes, for the west coast cluster you could use the QAKey as:
ssh -i QAKey.pem ubuntu@ec2-50-18-140-172.us-west-1.compute.amazonaws.com
and for the southeast cluster, using SingaporeQAKey.pem as:
ssh -i SingaporeQAKey.pem ubuntu@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com

I shall email you both the keys.

Comment by Junyi Xie (Inactive) [ 15/Oct/12 ]
On southeast cluster where there is an ongoing XDCR with high get but low set/delete, I saw a bunch of source lost the conflict resolution, consistent with the stats on UI.

Please remove the expired items from your workload, otherwise we cannot determine it is a bug or not.

[xdcr:debug,2012-10-15T22:16:07.067,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24978.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:07.071,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25039.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:07.074,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24910.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:07.078,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24916.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:07.081,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25327.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:07.225,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24889.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:07.230,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25085.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:14.582,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25015.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:22.882,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25084.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
[xdcr:debug,2012-10-15T22:16:29.451,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24913.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
Comment by Ketaki Gangal [ 16/Oct/12 ]
Seeing this w/ bidirectional setup as well:

Num checkpoints =0
Num seconds in checkpointing = 0

Adding screenshot from both the clusters.
Comment by Junyi Xie (Inactive) [ 18/Oct/12 ]
http://review.couchbase.org/#/c/21613/
Generated at Wed Jul 30 11:53:25 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.