[MB-4366] ns_server is reusing tap names unsafely which causes data loss or inconsistency in replication when a node is removed and added back Created: 19/Oct/11  Updated: 09/Jan/13  Resolved: 12/Apr/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.7.2, 1.8.0
Fix Version/s: 1.8.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Farshid Ghods (Inactive) Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: 1.8.1-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2011-10-19 at 11.25.13 PM.png     PNG File Screen Shot 2011-10-19 at 5.19.37 PM.png     PNG File Screen Shot 2011-10-19 at 5.23.19 PM.png    
Issue Links:

screenshot attached

NOTE: we're converting this to main 'named tap issues' ticket.

So what's not safe about reusing named taps as of 1.8.0?

If something happened to destination node after tap was disconnected. And if that something affected data for vbuckets replicated as part of named tap, then subsequent reuse of named tap will incorrectly assume that we can continue sending stuff instead of re-negotiating which data needs to be resent.

Comment by Farshid Ghods (Inactive) [ 19/Oct/11 ]
another screenshot : 5 minutes after stopping the rebalance
Comment by Farshid Ghods (Inactive) [ 20/Oct/11 ]
tap stream only stops if there is no item added to the backlog
if the user keeps the load running this tap stream remains alive forever
Comment by Aleksey Kondratenko [ 22/Dec/11 ]
Farshid, cannot make sense of this screenshots. Can you elaborate?
Comment by Farshid Ghods (Inactive) [ 22/Dec/11 ]
basically that means there is still one tap_rebalance stream open and running even after rebalance was stopped.

we seem to be stopping most of the streams except one
Comment by Farshid Ghods (Inactive) [ 22/Dec/11 ]
waiting 5 minutes will not work if there are ongoing mutuations in the cluster because this tap stream only times out after 5 minutes of inactivity
Comment by Aleksey Kondratenko [ 22/Dec/11 ]
so it's ep-engine issue then ? I mean we close tap streams as much as possible in ns_server. Named tap streams are kept alive by ep-engine. If there's anything ns_server can do to really stop those tap producers, I'll be happy to do that.
Comment by Steve Yen [ 30/Mar/12 ]
this is the main ticket for the named tap approach/fix
Comment by Steve Yen [ 30/Mar/12 ]
is this a blocker for 1.8.1?
Comment by Dipti Borkar [ 30/Mar/12 ]
Yes, because this may be causing data loss in some conditions.

Farshid, I believe there are a few other tickets where this is the underlying problem. Can you reference them here for completeness? Thanks
Comment by Aleksey Kondratenko [ 07/Apr/12 ]
http://review.couchbase.org/14555 fixes it on 1.8.1.

1.8 and master have a bit different code in this area so this work still needs some forward-porting.
Comment by Steve Yen [ 09/Apr/12 ]
fix is in gerrit (but more work still needed to enable 1.8.2)
Comment by Aleksey Kondratenko [ 09/Apr/12 ]
let's keep this open for now. While I'll adapt it for 1.8.2 I may have to change 1.8.1 code to enable forward-compatibility with 1.8.2 and master
Comment by Dipti Borkar [ 11/Apr/12 ]
Aliaksey, code complete is friday and we need to merge everything in by then.
What changes need to be made to ensure forward-compatibility?
Comment by Aleksey Kondratenko [ 11/Apr/12 ]
Minor. I'll be doing that tomorrow first-priority.
Comment by Aleksey Kondratenko [ 12/Apr/12 ]
I've found no further changes to 1.8.1 are needed. 1.8.2 implementation is here http://review.couchbase.org/14827
Comment by Thuan Nguyen [ 20/Apr/12 ]
Integrated in github-ns-server-2-0 #333 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/333/])
    only reuse tap name when changing vbucket filter.MB-4366 (Revision 61bf78355e64fff2e807939fea385862ca6919d5)
reimplemented named tap fix for branch-18. MB-4366 (Revision e3b833480ceb5b7832e22131ed5d3fb532e6ea83)

     Result = SUCCESS
Aliaksey Artamonau :
Files :
* src/ns_server_cluster_sup.erl
* src/ebucketmigrator_srv.erl
* src/ns_vbm_sup.erl

Aliaksey Artamonau :
Files :
* src/ns_vbm_new_sup.erl
* src/ns_vbm_sup.erl
* src/ebucketmigrator_srv.erl
* src/ns_server_cluster_sup.erl
* src/cb_gen_vbm_sup.erl
Comment by Thuan Nguyen [ 25/Apr/12 ]
Integrated in github-ns-server-2-0 #337 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/337/])
    fixed typo in start_vbucket_filter_change. MB-4366 (Revision 5db3c35e8a5ff6a5885271df4466b30c5369fa38)

     Result = SUCCESS
Steve Yen :
Files :
* src/ebucketmigrator_srv.erl
Generated at Sun Apr 20 18:33:44 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.