Java client not recovering after failover (server shutdown)

yorugua · April 10, 2015, 6:44pm

I have 6 couchbase nodes with 3 replicas and autofailover. If I stop one node “service couchbase-server stop” the failover runs ok and the java client gets notified and removes the node from its internal map.
2015-04-10 15:38:50 c.c.client.core.node.Node [INFO] Disconnected from Node xxx

If instead of stopping the couchbase-server service I shutdown the machine the java client never gets notified and tries to hit that dead node every time. The autofailover happens but the client does not update its internal node map.

Any ideas on this?

I am using Couchbase Server Version: 3.0.1 in Ubuntu 64bits and 3.0.3 Enterprise.
Couchbase Java Client 2.1.2

Thanks

ingenthr · April 10, 2015, 7:41pm

What’s the workload? If you’re shutting down the server by “shutdown -h” or the like, the behavior should be the same, but simulating a failure is rather different. You’d not have a TCP RST message, so it may take a certain amount of ‘failed’ workload for the client to update it’s internal map. There’s a backstop as well that should eventually get the client to update.

The scenario you describe is like a whole section of our testing, so I’m pretty confident it’s correct. Our test has a basic workload. More info on the scenario would be appreciated.

yorugua · April 10, 2015, 7:54pm

I am shutting down it stopping the EC2 instance through the AWS manager console.
The workload that I am using in this particular case is low, just a few request because I was trying to test that particular scenario.

I will try with a heavy workload.

Thanks!

yorugua · April 10, 2015, 8:18pm

I tried with a heavier workload hitting couchbase with 800 get ops per second for 60 seconds without any success. The client does not update its configuration unless I restart it.

yorugua · April 13, 2015, 1:42pm

I think that it has something to do with the Carrier Publication. I disabled it and it worked as expected.
CouchbaseEnvironment environment = DefaultCouchbaseEnvironment
.builder().
.bootstrapCarrierEnabled(false)
.bootstrapHttpDirectPort(8080)
.build();

But this is a workaround.
When using the carrier publication the client never gets notified when a node shuts down in a “hard” fashion or if it has network issues.

Now I am testing it with a very simple scenario. One client and 2 couchbase nodes doing the failover manually.
Could you test the same scenario? An easy way to test it is by disconnecting one couchbase node from the network.

Thanks in advance.

yorugua · April 13, 2015, 2:35pm

I add more information to the issue.
In the logs I get every 20-30 seconds a keep alive request without errors nor responses.
2015-04-13 11:28:48 c.c.c.c.e.AbstractGenericHandler [DEBUG] [node-a/10.10.9.135:11210][KeyValueEndpoint]: KeepAlive fired.

daschl · April 14, 2015, 9:31am

@yorugua is it possible for you to share the code that you are using and the steps to reproduce? That would greatly help. Also, if you can share TRACE level logging that would be great.

If you don’t want to share it publicly you can also drop me an email.

yorugua · April 14, 2015, 3:02pm

Document with id A in Node 1
Document with id B in Node 2
Api client with JAVA sdk 2.1.2

Get key A (OK)
Get key B (OK)

Unplug Node 2 network cable

Get key A (OK)
Get key B (FAIL expected until failover)
java.lang.RuntimeException: java.util.concurrent.TimeoutException
at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:93) ~[java-client-2.1.2.jar:2.1.2]

Failover Node 2 (without doing the rebalance)

Get key A (OK)
Get Key B (FAIL not expected behavior)
java.lang.RuntimeException: java.util.concurrent.TimeoutException
at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:93) ~[java-client-2.1.2.jar:2.1.2]

The same happens in the cloud when stopping the EC2 instance of Node 2.
However if I stop the couchbase-service in Node 2 doing a “service couchbase-server stop” it works as expected.

If I disable the bootstrap carrier all scenarios work as expected. But it is not the idea.

Code:

public class CouchbaseRepository {

private Cluster cluster;
private Bucket bucket;

public CouchbaseRepository() {
	//Initialization of cluster and bucket
	CouchbaseEnvironment environment = DefaultCouchbaseEnvironment
			.builder().requestBufferSize(16384)
			.build();
	cluster = CouchbaseCluster.create(environment, "10.10.8.189,10.10.9.135");
	bucket = cluster.openBucket("default");
}


public JsonDocument getByKey(String key) {
	JsonDocument doc = bucket.get(key);
	return doc;
}

…

Thanks!

daschl · April 14, 2015, 3:25pm

Would it be possible for you to also share the logs (trace) as well?

Also btw, how are you executing the load? How many ops/s?

yorugua · April 14, 2015, 3:44pm

I run the load with apache benchmark so I tested it with different amounts of requests and threads. Sometimes 5000 ops per second.

I store the objects with the following method

public MyObject store(MyObject object) throws Exception {
JsonLongDocument doc = bucket.counter(“mycounter::”, 1, 1);
Long id = doc.content();
object.setId(id);
ObjectMapper mapper = new ObjectMapper();

String json = mapper.writeValueAsString(object);
JsonTranscoder tr = new JsonTranscoder();
JsonObject jsonObject = tr.stringToJsonObject(json);
JsonDocument myDoc = JsonDocument.create(KEY_PREFIX + id, jsonObject);
bucket.insert(myDoc);
return object;

}

I’ll grab the logs and send them to you.

Thanks

Topic		Replies	Views
Java client not aware about failed over node Java SDK	2	2060	May 8, 2014
Question for connection timeout Java SDK	3	2322	September 19, 2016
Node failure blocks Java client Java SDK	12	5011	April 5, 2017
Data to persist loss when 1 of 3 nodes is restarted Java SDK	5	2042	July 25, 2013
Java client (1.4.1) taking long to receive configuration updates after fail over Java SDK	0	1707	June 6, 2014

Java client not recovering after failover (server shutdown)

Related topics