[MB-6041] XDC replication keeps on replicating even after replication document is removed Created: 27/Jul/12  Updated: 29/Aug/12  Resolved: 28/Aug/12

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: None
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aliaksey Artamonau Assignee: Junyi Xie (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File ns-diag-20120727231728.txt.xz     File ns-diag-20120823192112.txt.bz2    

- create replication
- upload some data into the source bucket
- remove the replication (replication document is not present in _replicator/_all_docs anymore)
- observe that number of items in the destination bucket keeps growing

seeing in this on current HEAD

Comment by Junyi Xie (Inactive) [ 27/Jul/12 ]
There could be some delay between the time you remove the rep doc and XDCR manager got notified and canceled all replications. Can you please provide the log of source?
Comment by Aliaksey Artamonau [ 27/Jul/12 ]
It seems that it happens when there are more than one replication (possibly to the same cluster). I initially observed it when I had two replications between two clusters. Then I tried to reproduced with only one replication. It worked flawlessly. Then I tried it with two replications again and again observed the bug. Attaching a diag from a source.
Comment by Junyi Xie (Inactive) [ 01/Aug/12 ]
Cannot open xz file on MacOS. Can you upload a .gz or .tar package? Thanks
Comment by Peter Wansch (Inactive) [ 15/Aug/12 ]
Ketaki, can you try this one out before and after Damien's changes.
Comment by Abhinav Dangeti [ 22/Aug/12 ]
- Set up a 2:2 unidirectional replication on build 1623.
- Load on source, replication kicks off on destination.
- Deleted the replication on the source side:
  - Replication ceases to stop immediately on the destination.
  - I expected the replication would stop when the item count reaches the count on the source when I killed the replication.
  - However, the count surpasses that check point but does stop at a point much later, with the load on the source still going.
Comment by Junyi Xie (Inactive) [ 22/Aug/12 ]
Locally I created 1-1 clusters, each with two bucktes, default and default2. Start two concurrent XDCR for default and defult2, and then delete the two replication docs from UI. Both replications stopped within several seconds after I deleted the replication doc. At least at local testing, I do not see any issue.

Aliaksey, can you please retry the latest code to see if the issue still exists? Thanks.

Comment by Aliaksey Artamonau [ 23/Aug/12 ]
I was able to reproduce it by creating two replications from the same bucket on the source to two different buckets on destination. Probably it's not very realistic scenario. But it might uncover an important issue. Will attach diag from the source cluster shortly.
Comment by Aliaksey Artamonau [ 23/Aug/12 ]
Replications stopped finally stopped several minutes after I removed corresponding replication documents.
Comment by Junyi Xie (Inactive) [ 28/Aug/12 ]
I tried the same setting as yours (1 -> 1 replication, default@node1 -> default@node2, and default@node1 -> default2@node2), and it seems there is nothing wrong.

From the log below, XDCR replication manager got notified from ns_server instantly after I deleted the replication doc from UI, it instantly shutdown all ongoing bucket replication process, with no delay. And all XDCR activity stopped at source right after that. However, there could be some activity on destination cluster even after XDCR stopped replication on source side, because it may take a while to persist all items in memory to storage. I am not sure if there is any delay between UI stats and the real activity. Also, if both nodes in your test are on the local machine with 1024 vbuckets, it may take longer to finish. I think the delay should be much shorter if we use VMs to conduct the test.

At this time I am not sure what to fix. I merged some logs for timing purpose, and will ask Ketaki to do the same test on VM. If it is really an issue, we will reopen this bug and investigate the logs from VM.

[couchdb:info,2012-08-28T14:43:47.255,n_0@<0.742.0>:couch_log:info:39] - - DELETE /_replicator/1d38c26cdc5c5bb0e6be126e8ae272be%2Fdefault%2Fdefault?rev=1-9ee1a1c9 200
[xdcr:debug,2012-08-28T14:43:47.257,n_0@]replication doc deleted (docId: <<"1d38c26cdc5c5bb0e6be126e8ae272be/default/default">>), stop all replications
[xdcr:debug,2012-08-28T14:43:47.258,n_0@]all replications for DocId <<"1d38c26cdc5c5bb0e6be126e8ae272be/default/default">> have been stopped

[ns_server:debug,2012-08-28T14:43:47.259,n_0@<0.2113.0>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_config_events,<0.2112.0>} exited with reason shutdown
[ns_server:debug,2012-08-28T14:43:47.260,n_0@<0.2113.0>:ns_pubsub:do_subscribe_link:149]Deleting {ns_config_events,<0.2112.0>} event handler: ok
[xdcr:debug,2012-08-28T14:43:47.296,n_0@<0.11655.0>:xdc_vbucket_rep_worker:find_missing:121]after conflict resolution at target ("http://Administrator:asdasd@\
f256233b9dffc119c2c32325a512/"), out of all 396 docs the number of docs we need to replicate is: 396
[couchdb:info,2012-08-28T14:43:47.304,n_0@<0.1858.0>:couch_log:info:39]checkpointing view update at seq 5 for _replicator _design/_replicator_info
[couchdb:info,2012-08-28T14:43:47.320,n_0@<0.1852.0>:couch_log:info:39] - - GET /_replicator/_design/_replicator_info/_view/infos?group_level=1&_=1346179427278 200
[ns_server:debug,2012-08-28T14:44:00.037,n_0@]Starting compaction for the following buckets:
[ns_server:info,2012-08-28T14:44:00.074,n_0@<0.13612.0>:compaction_daemon:try_to_cleanup_indexes:439]Cleaning up indexes for bucket `default`
[ns_server:info,2012-08-28T14:44:00.164,n_0@<0.13612.0>:compaction_daemon:spawn_bucket_compactor:404]Compacting bucket default with config:

Comment by Junyi Xie (Inactive) [ 28/Aug/12 ]
Comment by Thuan Nguyen [ 28/Aug/12 ]
Integrated in github-ns-server-2-0 #456 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/456/])
    MB-6041: add logs to time replication stop (Revision 1b1cf1f99f6e84b0baaa90a9ac2504b46e1d583a)

     Result = SUCCESS
Junyi Xie :
Files :
* src/xdc_rep_manager.erl
* src/xdc_replication_sup.erl
Generated at Mon Oct 20 05:38:04 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.