Index-based queries failing after node removal

cole.s · August 25, 2021, 9:13pm

Hello,
I am running Couchbase Server CE 5.0.2 with the go SDK client 2.2.2. I have two CB server nodes, both running data/index/query services and both having the same indexes running on them. I am connecting to the CB cluster via a DNS SRV record. I am running into a problem when I replace one of the CB nodes with an identical one and rebalance the old node out, queries from the Go client that use an index silently fail about 50% of the time until the Go client is restarted. My procedure is as follows:

Bring up new Couchbase instance
Join it to the existing cluster
Add new node to the DNS SRV entry
Perform rebalance (new node is now fully in cluster)
Create indexes on new node (same index definitions as are on the other two nodes)
Remove old node from DNS SRV entry
Mark same old node for removal
Perform rebalance (old node is now gone from cluster)

Once the old node is gone from the cluster, the Go SDK queries seem to silently fail ~50% of the time, as stated above. Restarting my services that use the Go SDK is the only solution I have found, waiting for a number of minutes has not fixed the problem. This isn’t really an acceptable solution in a production environment. It seems as though the Go SDK is somehow stuck using the old node’s indexes even though they are no longer around and there are perfectly serviceable replacement indexes available for it to use. Is there a configuration flag I am missing or something that would fix this behavior?

Thanks,
Cole

chvck · August 26, 2021, 7:53am

Hi @cole.s first off I should say that 5.0.2 is old now and you should consider upgrading (it’s listed as compatible but not supported with v2 at Compatibility | Couchbase Docs).

That being said, this should still work. The SDK does not care about indexes, only the nodes themselves. A few questions might which help to track this down:

What do you mea by silently fail? If they fail they must be returning an error of some sort?
Do you have any SDK debug logs that you could share?
Are you setting AdHoc to true on the QueryOptions? (by default the SDK uses prepared statements, and I suspect that something odd could be going on here - a simple test for this if you aren’t setting AdHoc would be to set it and see if the problem goes away).
Do the queries still work after the first rebalance when you rebalance the new node in but haven’t yet rebalanced the old node out?

The SDK uses a polling mechanism for fetching the cluster topology (2.5s default I think). When you rebalance any nodes in or out of the cluster and the SDK sees that this change has occurred it will update its internal state to reflect this - removing nodes rebalanced out and adding nodes rebalanced in . Each time that a query is sent the SDK internally looks at its current list of nodes that support the query service and selects one at random to send the query to.

cole.s · August 26, 2021, 1:10pm

Thanks for the reply! I appreciate the advice on 5.0.2 - this work I’m doing is actually part of our efforts to start down the upgrade path towards newer versions. In response to your questions:

By “silently fail”, I mean that I call Cluster.Query() and it returns a non-nil QueryResult and a nil error, but there are no items actually in the result (even though there definitely should have been). Interestingly if I call QueryResult.MetaData(), Metrics.ErrorCount on that object is a non-zero value.
I do not currently have debug logs but I can attempt to get them.
I am not setting AdHoc on QueryOptions, I will set that and see if it changes anything.
Yes, the queries work after the new node has been balanced in. It is only once the old node is gone that they stop working.

cole.s · August 26, 2021, 1:13pm

A follow-on regarding setting AdHoc - are there any performance considerations in setting that?

cole.s · August 26, 2021, 4:40pm

Setting AdHoc to true fixed this for me. Assuming there are not significant negative performance impacts (this other forum post here N1QL - Is it necessary to use parameterized queries with adhoc(false) seems to indicate it will be ok), I will proceed with this solution. Thank you!

Kevin.Cherkauer · September 2, 2021, 5:03am

@cole.s Glad you found a solution! Note that starting in 5.0 Couchbase supports indexes with replicas. This is a more integrated solution where the Index nodes understand that an index and all its replicas are all one coherent object, whereas for multiple equivalent different indexes like you are using it does not understand that and so may not transition as smoothly in the face of different equivalent indexes coming and going.

There is a lot more detail on this in these two blog articles:

Diving Into Couchbase Index Replicas:

How to Transition from Equivalent Indexes to Index Replicas:

It is very easy to create replicated indexes as well as control which nodes the index and its replicas go on if desired. Hope you find this helpful.

cole.s · September 2, 2021, 12:46pm

Appreciate the information, @Kevin.Cherkauer . My understanding is that index replicas are an EE-only feature, and I am currently running CE.

Kevin.Cherkauer · September 2, 2021, 4:10pm

@cole.s Thank you for giving Couchbase a whirl! It is true that CE does not have all the features of EE. CE can do a lot for try-out and small workloads but is not really designed for enterprise-scale workloads. Hopefully your business will keep growing.

Topic		Replies	Views
Understand index behavior after recovering failed node Couchbase Server index	3	1992	August 15, 2018
Couchbase query fails when one data node goes down Couchbase Server	2	1215	August 31, 2021
Rebalance required for cluster map to be refreshed Couchbase Server	10	1116	September 2, 2021
Couchbase index gone/disapperar SQL++ index	2	765	June 9, 2018
Primary N1QL Index for a bucket mysteriously disappears just after graceful failover of a node SQL++ 40-rc	3	4668	November 23, 2015

Index-based queries failing after node removal

Related topics