Details
Description
Setup
1.Setup a 18 node cluster with 2 buckets- bucket1, bucket2
2. Enable auto-failover
3. Add a new node 126
4. Rebalance
Output
1. Rebalance works fine. But seeing these log messages -
Could not automatically failover node 'ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>>' because I think rebalance is running auto_failover000 ns_1@10.3.2.104<mailto:ns_1@10.3.2.104><mailto:ns_1@10.3.2.104<mailto:ns_1@10.3.2.104>> 19:32:12 - Sun Jun 17, 2012
Bucket "bucket1" loaded on node 'ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>>' in 0 seconds. ns_memcached001 ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>> 19:32:04 - Sun Jun 17, 2012
Started rebalancing bucket bucket2 ns_rebalancer000 ns_1@10.3.2.104<mailto:ns_1@10.3.2.104><mailto:ns_1@10.3.2.104<mailto:ns_1@10.3.2.104>> 19:31:36 - Sun Jun 17, 2012
Starting rebalance, KeepNodes = ['ns_1@10.3.2.85<mailto:ns_1@10.3.2.85><mailto:ns_1@10.3.2.85<mailto:ns_1@10.3.2.85>>','ns_1@10.3.2.86<mailto:ns_1@10.3.2.86><mailto:ns_1@10.3.2.86<mailto:ns_1@10.3.2.86>>',
'ns_1@10.3.2.87<mailto:ns_1@10.3.2.87><mailto:ns_1@10.3.2.87<mailto:ns_1@10.3.2.87>>','ns_1@10.3.2.88<mailto:ns_1@10.3.2.88><mailto:ns_1@10.3.2.88<mailto:ns_1@10.3.2.88>>',
'ns_1@10.3.2.89<mailto:ns_1@10.3.2.89><mailto:ns_1@10.3.2.89<mailto:ns_1@10.3.2.89>>','ns_1@10.3.2.104<mailto:ns_1@10.3.2.104><mailto:ns_1@10.3.2.104<mailto:ns_1@10.3.2.104>>',
'ns_1@10.3.2.105<mailto:ns_1@10.3.2.105><mailto:ns_1@10.3.2.105<mailto:ns_1@10.3.2.105>>','ns_1@10.3.2.106<mailto:ns_1@10.3.2.106><mailto:ns_1@10.3.2.106<mailto:ns_1@10.3.2.106>>',
'ns_1@10.3.2.108<mailto:ns_1@10.3.2.108><mailto:ns_1@10.3.2.108<mailto:ns_1@10.3.2.108>>','ns_1@10.3.2.109<mailto:ns_1@10.3.2.109><mailto:ns_1@10.3.2.109<mailto:ns_1@10.3.2.109>>',
'ns_1@10.3.2.110<mailto:ns_1@10.3.2.110><mailto:ns_1@10.3.2.110<mailto:ns_1@10.3.2.110>>','ns_1@10.3.2.111<mailto:ns_1@10.3.2.111><mailto:ns_1@10.3.2.111<mailto:ns_1@10.3.2.111>>',
'ns_1@10.3.2.112<mailto:ns_1@10.3.2.112><mailto:ns_1@10.3.2.112<mailto:ns_1@10.3.2.112>>','ns_1@10.3.2.113<mailto:ns_1@10.3.2.113><mailto:ns_1@10.3.2.113<mailto:ns_1@10.3.2.113>>',
'ns_1@10.3.2.114<mailto:ns_1@10.3.2.114><mailto:ns_1@10.3.2.114<mailto:ns_1@10.3.2.114>>','ns_1@10.3.2.115<mailto:ns_1@10.3.2.115><mailto:ns_1@10.3.2.115<mailto:ns_1@10.3.2.115>>',
'ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>>'], EjectNodes = []
Attached are the web-logs and logs from master node-104.
https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/ns-diag-20120618095246.txt
https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/10.3.2.104-8091-diag.txt.gz
Other related conversation
I have enabled auto-failover on the large-cluster and every time I rebalance In a node, I get an error message showing " Could not automatically failover node 'ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>>' because I think rebalance is running" .
The node 126 is newly added and rebalance issued, is this message displayed because the node is not yet ready to join the cluster ?
The rebalance works fine, but I do not understand why is auto-failover attempted in here. Any idea?
No. according to logs at 19:32:04 bucket1 was loaded. Maybe there are some other buckets that are still not ready on this node. May I have logs?
1.Setup a 18 node cluster with 2 buckets- bucket1, bucket2
2. Enable auto-failover
3. Add a new node 126
4. Rebalance
Output
1. Rebalance works fine. But seeing these log messages -
Could not automatically failover node 'ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>>' because I think rebalance is running auto_failover000 ns_1@10.3.2.104<mailto:ns_1@10.3.2.104><mailto:ns_1@10.3.2.104<mailto:ns_1@10.3.2.104>> 19:32:12 - Sun Jun 17, 2012
Bucket "bucket1" loaded on node 'ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>>' in 0 seconds. ns_memcached001 ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>> 19:32:04 - Sun Jun 17, 2012
Started rebalancing bucket bucket2 ns_rebalancer000 ns_1@10.3.2.104<mailto:ns_1@10.3.2.104><mailto:ns_1@10.3.2.104<mailto:ns_1@10.3.2.104>> 19:31:36 - Sun Jun 17, 2012
Starting rebalance, KeepNodes = ['ns_1@10.3.2.85<mailto:ns_1@10.3.2.85><mailto:ns_1@10.3.2.85<mailto:ns_1@10.3.2.85>>','ns_1@10.3.2.86<mailto:ns_1@10.3.2.86><mailto:ns_1@10.3.2.86<mailto:ns_1@10.3.2.86>>',
'ns_1@10.3.2.87<mailto:ns_1@10.3.2.87><mailto:ns_1@10.3.2.87<mailto:ns_1@10.3.2.87>>','ns_1@10.3.2.88<mailto:ns_1@10.3.2.88><mailto:ns_1@10.3.2.88<mailto:ns_1@10.3.2.88>>',
'ns_1@10.3.2.89<mailto:ns_1@10.3.2.89><mailto:ns_1@10.3.2.89<mailto:ns_1@10.3.2.89>>','ns_1@10.3.2.104<mailto:ns_1@10.3.2.104><mailto:ns_1@10.3.2.104<mailto:ns_1@10.3.2.104>>',
'ns_1@10.3.2.105<mailto:ns_1@10.3.2.105><mailto:ns_1@10.3.2.105<mailto:ns_1@10.3.2.105>>','ns_1@10.3.2.106<mailto:ns_1@10.3.2.106><mailto:ns_1@10.3.2.106<mailto:ns_1@10.3.2.106>>',
'ns_1@10.3.2.108<mailto:ns_1@10.3.2.108><mailto:ns_1@10.3.2.108<mailto:ns_1@10.3.2.108>>','ns_1@10.3.2.109<mailto:ns_1@10.3.2.109><mailto:ns_1@10.3.2.109<mailto:ns_1@10.3.2.109>>',
'ns_1@10.3.2.110<mailto:ns_1@10.3.2.110><mailto:ns_1@10.3.2.110<mailto:ns_1@10.3.2.110>>','ns_1@10.3.2.111<mailto:ns_1@10.3.2.111><mailto:ns_1@10.3.2.111<mailto:ns_1@10.3.2.111>>',
'ns_1@10.3.2.112<mailto:ns_1@10.3.2.112><mailto:ns_1@10.3.2.112<mailto:ns_1@10.3.2.112>>','ns_1@10.3.2.113<mailto:ns_1@10.3.2.113><mailto:ns_1@10.3.2.113<mailto:ns_1@10.3.2.113>>',
'ns_1@10.3.2.114<mailto:ns_1@10.3.2.114><mailto:ns_1@10.3.2.114<mailto:ns_1@10.3.2.114>>','ns_1@10.3.2.115<mailto:ns_1@10.3.2.115><mailto:ns_1@10.3.2.115<mailto:ns_1@10.3.2.115>>',
'ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>>'], EjectNodes = []
Attached are the web-logs and logs from master node-104.
https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/ns-diag-20120618095246.txt
https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/10.3.2.104-8091-diag.txt.gz
Other related conversation
I have enabled auto-failover on the large-cluster and every time I rebalance In a node, I get an error message showing " Could not automatically failover node 'ns_1@10.3.121.126<mailto:ns_1@10.3.121.126><mailto:ns_1@10.3.121.126<mailto:ns_1@10.3.121.126>>' because I think rebalance is running" .
The node 126 is newly added and rebalance issued, is this message displayed because the node is not yet ready to join the cluster ?
The rebalance works fine, but I do not understand why is auto-failover attempted in here. Any idea?
No. according to logs at 19:32:04 bucket1 was loaded. Maybe there are some other buckets that are still not ready on this node. May I have logs?
This happens if you have 2 or more buckets and autofailover is enabled and if both buckets have significant amount of data.
After rebalancing out first bucket node will incorrectly be interpreted as down by autofailover service since it doesn't have all buckets this service thinks (incorrectly) it needs to have.
Normally rebalance prevents autofailover to actually do anything, but if rebalance is stopped, then 'partially' rebalance out node will be automatically failed over.
Seen here: https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/10.3.2.104-8091-diag.txt.gz