Get from replicate not working as expected

dgrizzanti · January 4, 2018, 4:15pm

Hi All,

Hoping to get some help with an issue we’re seeing related to getting documents from the replicate after a node goes down. Summarized below.

Thanks!

Setup:

5 node cluster
Number of replicas set to 3 on each bucket
Auto-failover set to 2 minutes
Confirmed auto-failover works when taking single node out of the cluster
Confirmed data is replicated by taking a node out and rebalancing on remaining nodes

Assumptions:

Our assumption is that we are able to still function by staying connected to the 4 nodes that are live and pull documents from the vReplica if needed (by first attempting to get the document, recovering and getting from replica)
What we’re seeing is that only 4/5 of the data is available after 1 node goes down, despite using the getFromReplica methods provided in the java client

What we’re trying to achieve:

Application can handle a single node going down with being affected
- Connection attempt failures to downed node are fine, but ideally it would be able to recover from failed documented lookups by getting from the replica on a live node
- Eventually node would auto-failover
In the event two nodes go down and auto-failover does not occur, we could still run with the remaining 3 nodes by getting existing data from the replicate until someone intervenes to manually failover and rebalance

Snippet of code we’re using to get from replica

asyncBucket.get(id, classOf[RawJsonDocument]).onErrorResumeNext(async.getFromReplica(id, ReplicaMode.ALL, classOf[RawJsonDocument])).singleOption

k_reid · January 22, 2018, 2:39pm

@ingenthr @vsr1 @daschl Can any of you help with this?

ingenthr · January 23, 2018, 1:12am

I think your assumptions are valid @dgrizzanti and what you’re trying to achieve is reasonable. I don’t see a description of how you’re triggering the failure or what behavior you’re seeing though.

I’ll defer to @daschl, but it could be related to the .onErrorResumeNext() in that it depends on how things are failing, In the case of a fall-off-the-network down node, the TCP connection is still half open, so the failure mode would be TimeoutException for a while. The problem is your default timeout for the overall get may be the same timeout value?

So, you may want to revisit how you’re creating the failure (best approach is to either down the network interface or have a firewall drop packets). and make sure that’s triggering the error you expect before chaining in the what-to-do-next.

Do note that in addition to TimeoutException, you can also see a CancellationException. The difference is that on timeout, the SDK is indicating it doesn’t know what has actually happened with the operation, while on cancellation, it’s telling you that it wasn’t sent to the network.

dgrizzanti · January 23, 2018, 1:34pm

@ingenthr thanks for getting back to me. I should have given the sample failure scenario we tried in the original description, but will try to describe that now.

In order to test a scenario where a node failure occurs, we did the following:

Started with 3 active nodes, each bucket’s replica set to 3
Auto Failover is turned off
Created 1k documents while all 3 nodes were active
While no processes were running trying to access this data, we shut down the couchbase process on node 1
Run script that takes advantage of getFromReplica to try and retrieve all 1k documents across the remaining 2 nodes

In that scenario above, when 1 node as down we would always get 2/3 of the documents returned. This is not our normal use case but we wanted to test with something as straightforward as possible to test out the getFromReplica option.

Thanks

Topic		Replies	Views
Node Offline and GetReplica failure Go SDK	7	3007	July 14, 2015
How getFromReplica works ? between nodes or between Data Centers? Couchbase Server	7	2672	October 1, 2018
Fallback/failover problems PHP SDK	7	2419	November 3, 2016
Node failure blocks Java client Java SDK	12	4968	April 5, 2017
Replication failure scenario Couchbase Server	2	636	October 17, 2021

Get from replicate not working as expected

Related topics