Various Issues with SG views/rebalance

SG Version: 2.1

CB Version: 6.0 CE

Number documents: 33m

Number servers: 5

RAM/Server: 15GB

Cores/Server: 4

Percentage Resident Documents: 23%

We are noticing that when a server fails, the rebalance causes all of the servers to rebuild the SG views, which maxes out the CPU on all remaining nodes, often (almost always) causing the rebalance to fail and often leading to more servers failing over. We notice a similar pattern when running taxing queries using N1QL, it often leads to high CPU usage and servers failing over. Further more, during a rebalance the disk usage often increases so much on several of the nodes that we get disk full warnings followed by a failure to write to disk.

We have restored a cluster from a backup and removed all SG views (leaving the CB indexes) so that we can perform some N1QL queries. Once the views were removed the disk usage dropped substantially, the CPU usage decreased and rebalances started working first time and quickly. N1QL queries were performant and did not cause fail overs, the number of resident documents increased substantially and everything in the cluster appeared to work much better.

We would be interested in swapping SG to use indexes rather than views (use_views: false, in config), but we believe that in CE this would require setting index replicas to 0, and we are concerned that a loss of a node would result in the indexes needing to be rebuilt and thus leaving us in a situation where CPU would increase and our performance would degrade beyond an acceptable level.

Perhaps someone can answer the following questions, the help us understand what is happening.

  1. Why do the views rebuild when a server is failed over? And is there a way to stop this?

  2. Why does the disk usage increase so much during rebalance/view building?

  3. What would happen in the case of a failover with regards to SG indexes (use_views: false)? would they rebuild automatically and if so, what impact would this have on serving requests for documents not yet indexed?

  4. Why are the servers failing over during periods of sustained high CPU usage? I cannot see anything relevant in the logs.

Any help understanding the above would be greatly appreciated.

Nathan