[MB-4906] autofailover may failover two nodes automatically within 1 minute if the master node is failed over and the old master nodes is elected as the master again Created: 17/Mar/12  Updated: 31/Jan/14  Resolved: 30/Mar/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0
Fix Version/s: 1.8.1, 2.0-beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Farshid Ghods (Inactive) Assignee: Aliaksey Artamonau
Resolution: Fixed Votes: 0
Labels: 1.8.1-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

this issue was reported by one of the users where autofailover was triggered twice instead of once on a cluster.

the root cause is still under investigation by aliaksey

Comment by Aleksey Kondratenko [ 17/Mar/12 ]
So this is what happened.

* node .238 that's master and runs autofailover service starts to have some networking issues and is split from rest of cluster

* cluster elects .240 as new master

* autofailover on new master fails over .238

* network problems on .238 are somehow resolved and connection to rest of cluster is restored

* now cluster has 2 masters briefly: .238 and .240. .240 surrenders mastership to .238

* now .238 is the only master and things are fine, except it's autofailover service is not aware of automatic failover that happened when .238 was disconnected

* when some other node has problems .238 fails it over automatically

So the fix is to make sure autofailover service is always using latest autofailover count that's stored in config.

Comment by Aleksey Kondratenko [ 30/Mar/12 ]
Done http://review.couchbase.org/14411
Comment by Aleksey Kondratenko [ 05/Apr/12 ]
somehow we didn't do it for 1.8.1 but for 1.8.2 instead. Will backport
Comment by Thuan Nguyen [ 05/Apr/12 ]
Integrated in github-ns-server-2-0 #328 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/328/])
    MB-4906 Always fetch autofailover count from config. (Revision a7e289d8b9f4e25ef1b4bf06956ad2074f50f0ea)
bp: MB-4906 Always fetch autofailover count from config. (Revision 79739e41a10fda00c537501e1b547baf5ac2c5d6)

     Result = SUCCESS
Aliaksey Kandratsenka :
Files :
* src/auto_failover.erl

Aliaksey Kandratsenka :
Files :
* src/auto_failover.erl
Generated at Sun Sep 14 21:39:09 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.