Couchbase node auto-failed-over and indices out of sync


Today we have encountered one incident that one of the three nodes in a couchbase cluster was auto-failed-over. After delta-recovery and rebalancing the cluster, we found that the n1ql is acting strange, i.e, returning different number of results for the past data for the same query being done again and again. We concluded that some of the GSI indices are somehow corrupted. We tested that on one index, by creating another index with the same definition and removing the old index. And it resolved the issue (i.e., the data returns consistently now).

Is that a known issue? May I know if there is a typical solution to prevent this?

Obviously this has been awhile but I too have experienced this phenomenon more than once.

My setup: six-node cluster of CE 4.5 on EC2s - full ejection, extensive use of views and GSIs, heavy writes. All indexes are duplicated for redundancy on two nodes. In other words, no index stored on the other four.

Every now and then a node is auto-failed-over. Why is that, anyway? Sometimes one of the index nodes is the one auto-failed-over. Full-recovery and rebalance will bring everything back but the GSIs are messed up. Simple select query with “use keys” sometimes returns zero results and sometimes returns something.

My conclusion = one copy of the GSIs is bad. I don’t know which copy is bad. I will drop one and if I am lucky, it is the bad one. Otherwise I have to drop both and recreate them.

I’ve learned to make copies of the N1QL create index statements, in case I loss an index node, I can recreate all the indexes in one scan with “build index”.

So, my questions - is this a known issue with CEs? Am I not utilizing CB cluster the right way? Is there a way to prevent this?

yehtutlwin, did you find a solution?

@mosabusan, we’ll need to take a look at the logs to see what’s gone wrong. Next time it happens, you can collect the logs from UI and share it with us. There is no known issue with failover in CE4.5.

Thanks @deepkaran.salooja. Will do.