I’m using a 3 node cluster, where I receive lots of timeout exceptions
java.lang.RuntimeException: java.util.concurrent.TimeoutException
at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:75)
at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:128)
at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:123)
Investigating on the cause there are two things I found that my be different from what they should be:
-
There is one node, whichs disk write queue seems to only grow (currently 623K) but not to decrease
-
On this node the projector seems to run when I check the processes but noting is listening on port 9999 (“netstat -ntpl | grep 9999” returns no result):
ps aux | grep projector
498 18784 0.0 0.0 474172 2880 ? Sl 2015 10:51 /opt/couchbase/bin/projector -kvaddrs=127.0.0.1:11210 -adminport=:9999 127.0.0.1:8091
We are using Cocuhbase 4.0.0-4051 and Java SDK 2.2.4.
What I found so far, there should be backoff starting when we reach 1M items in the disk write queue. And if I understood right, removing or manual failover of the node should lead to data loss of those items in the disk write queue as they are not replicated/ persisted now.
Does anyone have recommendations on what I could do to not loose the data in the disk write queue? I could think of removing the node (after being sure not to loose any data) because I want to update to 4.1. anyway.
If the problems may be related to the projector, which is not listening - is there a way to (re)start the projector without data loss and removing the node?
Any help would be greaty appreciated.