[MB-5298] Rebalance failed with reason {case_clause,{{ok,replica},{ok,replica}}} when rebalancing out a node which was failed over due to netwrok connectivity issues but it re-appears while rebalancing Created: 14/May/12  Updated: 09/Jan/13  Resolved: 17/Sep/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.1-release-candidate
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Karan Kumar (Inactive) Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: 1.8.1-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 1.8.1-815-rel

Attachments: GZip Archive c25ab549-07a1-4980-a8db-c2d0e5d293f1-10.1.3.50-diag.txt.gz     GZip Archive c25ab549-07a1-4980-a8db-c2d0e5d293f1-10.1.3.51-diag.txt.gz     GZip Archive c25ab549-07a1-4980-a8db-c2d0e5d293f1-10.1.3.52-diag.txt.gz     GZip Archive c25ab549-07a1-4980-a8db-c2d0e5d293f1-10.1.3.53-diag.txt.gz     GZip Archive c25ab549-07a1-4980-a8db-c2d0e5d293f1-10.1.3.54-diag.txt.gz     GZip Archive c25ab549-07a1-4980-a8db-c2d0e5d293f1-10.1.3.55-diag.txt.gz     GZip Archive c25ab549-07a1-4980-a8db-c2d0e5d293f1-10.1.3.56-diag.txt.gz     GZip Archive d7d06fcc-8859-47f3-b307-70cc957336a2-10.1.3.50-diag.gz     GZip Archive d7d06fcc-8859-47f3-b307-70cc957336a2-10.1.3.51-diag.gz     GZip Archive d7d06fcc-8859-47f3-b307-70cc957336a2-10.1.3.52-diag.gz     GZip Archive d7d06fcc-8859-47f3-b307-70cc957336a2-10.1.3.53-diag.gz     GZip Archive d7d06fcc-8859-47f3-b307-70cc957336a2-10.1.3.54-diag.gz     GZip Archive d7d06fcc-8859-47f3-b307-70cc957336a2-10.1.3.55-diag.gz     GZip Archive d7d06fcc-8859-47f3-b307-70cc957336a2-10.1.3.56-diag.gz    

 Description   
Failing testcase
failovertests.FailoverTests.test_failover_firewall,replica=3,load_ratio=10

[ns_server:info] [2012-05-14 13:26:46] [ns_1@10.1.3.55:<0.3846.2>:ns_janitor:wait_for_memcached:286] Waiting for "default" on ['ns_1@10.1.3.50','ns_1@10.1.3.51','ns_1@10.1.3.52',
                          'ns_1@10.1.3.54']
[ns_server:debug] [2012-05-14 13:26:46] [ns_1@10.1.3.55:ns_bucket_worker:ns_bucket_sup:update_childs:91] Stopping child for dead bucket: {{per_bucket_sup,"default"},
                                 <0.23773.0>,supervisor,
                                 [single_bucket_sup]}

[ns_server:debug] [2012-05-14 13:26:46] [ns_1@10.1.3.55:<0.23773.0>:single_bucket_sup:top_loop:28] Delegating exit {'EXIT',<0.23699.0>,shutdown} to child supervisor: <0.23774.0>


[error_logger:error] [2012-05-14 13:26:03] [ns_1@10.1.3.50:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_vbucket_mover:init/1
    pid: <0.29733.2>
    registered_name: []
    exception exit: {case_clause,{{ok,replica},{ok,replica}}}
      in function gen_server:terminate/6
    ancestors: [<0.28176.2>]
    messages: [{'EXIT',<0.19829.3>,
                      {exited,
                          {'EXIT',<0.29733.2>,
                              {case_clause,{{ok,replica},{ok,replica}}}}}},
                  {'EXIT',<0.19792.3>,
                      {exited,
                          {'EXIT',<0.29733.2>,
                              {case_clause,{{ok,replica},{ok,replica}}}}}},
                  {'EXIT',<0.15598.3>,
                      {exited,
                          {'EXIT',<0.29733.2>,
                              {case_clause,{{ok,replica},{ok,replica}}}}}}]

 Comments   
Comment by Aleksey Kondratenko [ 14/May/12 ]
I've added more logging in affected area(s). Please retest with latest stuff
Comment by Karan Kumar (Inactive) [ 22/May/12 ]
Adding the latest logs from 181-832-rel
Comment by Karan Kumar (Inactive) [ 22/May/12 ]
Failing test:-
failovertests.FailoverTests.test_failover_firewall,replica=3,load_ratio=10
Comment by Aleksey Kondratenko [ 23/May/12 ]
Good find. Now we have evidence that one of previously firewalled and failed over nodes is being (seemingly gradually) un-firewalled in the middle of rebalance. When rest of cluster actually already ejected this node. But this node discovers it was ejected 1 minute after it's replicator was able to push-replicate to one of existing nodes. We're starting to really hit limits of our naive cluster orchestration approach.
Comment by Aleksey Kondratenko [ 23/May/12 ]
I think reasonably simple treatment (still partial and naive) is to never restart replication automatically until janitor restarts it. It has some potential data safety implications though. I.e. janitor being really conservative in some cases will not restart replications that previously were automagically restarted. So not sure.
Comment by Aleksey Kondratenko [ 23/May/12 ]
Pull-based replication which is part of branch-18 would help here as well. Sadly we were not allowed to have it on branch-181.
Comment by Aleksey Kondratenko [ 23/May/12 ]
Need PM decision here
Comment by Aleksey Kondratenko [ 24/May/12 ]
May I ask for testing if this is a regression? My understanding no it's not.
Comment by Farshid Ghods (Inactive) [ 24/May/12 ]
should we run this against 1.8.0 ?
Comment by Aleksey Kondratenko [ 24/May/12 ]
If you can, please do. I'm pretty sure you'll hit this in 1.8.0 because failover & replication logic (hint: no quick failover was allowed for 1.8.1) is same.
Comment by Karan Kumar (Inactive) [ 29/May/12 ]
We are pretty much hitting this consistently on 181.

Comment by Aleksey Kondratenko [ 29/May/12 ]
Just stop doing this questionable practice of abusing firewall.
Comment by Dipti Borkar [ 29/May/12 ]
what are our options here to fix on 181? Given the likely hood of hitting this is more, we should try to fix.
Comment by Aleksey Kondratenko [ 29/May/12 ]
My understanding is that probability of hitting this in practice approaches zero. We had this issue since 1.6.0 yet nobody seen reported this problem.
Comment by Farshid Ghods (Inactive) [ 29/May/12 ]
this is not about abusing the firewall . its about node coming back up or re-appearing after its failed over.

if this is purely due to firewall then its ok to defer this
Comment by Aleksey Kondratenko [ 29/May/12 ]
your specific way of using firewall makes this problem probable to observe. That's my understanding.
Comment by Farshid Ghods (Inactive) [ 29/May/12 ]
firewall is our way of simulating a node disappearing and re-appearing. we can simulate that by also shutting down the network interface or pulling the network cable if it helps
Comment by Aleksey Kondratenko [ 29/May/12 ]
I have evidence that you're enabling firewall back in some very specific way. Particularly memcached traffic is re-enabled first. And then _minutes_ later you re-enable erlang traffic.
Comment by Dipti Borkar [ 30/May/12 ]
will defer to later release.
Comment by Aleksey Kondratenko [ 17/Sep/12 ]
Cannot happen on replicator on destination.
Generated at Thu Jul 31 16:45:37 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.