Endpoint not writeable

Another error that popped up after I migrated a service to use java sdk 3.0:
We do intensive usage of atomic counters. I have a class that does get, getAndTouch and increment.
After a while running the service (X100 request per seconds) I’m getting Timeout exception with reason Endpoint Not Writable.
The exact logs are:

[com.couchbase.tracing][OverThresholdRequestsRecordedEvent][10s] Requests over Threshold found: [{“top”:[{“operation_name”:“GetRequest”,“server_us”:0,“last_local_id”:“D6021CA500000001/0000000016FCC870”,“last_local_address”:“10.109.35.160:32284”,“last_remote_address”:“10.108.0.96:11207”,“last_dispatch_us”:815997,“last_operation_id”:“0x10870b3”,“total_us”:1926025},{“operation_name”:“GetRequest”,“server_us”:0,“last_local_id”:“D6021CA500000001/0000000016FCC870”,“last_local_address”:“10.109.35.160:32284”,“last_remote_address”:“10.108.0.96:11207”,“last_dispatch_us”:815991,“last_operation_id”:“0x10870b2”,“total_us”:1926022}

and then:

[com.couchbase.config][BucketConfigRefreshFailedEvent] Reason: INDIVIDUAL_REQUEST_FAILED, Type: KV, Cause: com.couchbase.client.core.error.RequestCanceledException: CarrierBucketConfigRequest {“cancelled”:true,“completed”:true,“coreId”:“0xd6021ca500000001”,“idempotent”:true,“reason”:“NO_MORE_RETRIES (ENDPOINT_NOT_WRITABLE)”,“requestId”:17455328,“requestType”:“CarrierBucketConfigRequest”,“retried”:0,“service”:{“bucket”:“ad-stats”,“collection”:“_default”,“opaque”:“0x10a439b”,“scope”:“_default”,“target”:“10.108.0.96”,“type”:“kv”},“timeoutMs”:2500} {“coreId”:“0xd6021ca500000001”}

from this point all requests fail

Note that I tried with several kv circuit breakers definition, disable circuit breaker at all, with several retry strategies (best effort and fail fast)
Previous code with sdk 2 worked fine.

My question is what can cause this issue and how can I avoid it?

Thanks,
Asher

By default circuit breakers are disabled. From your previous post I saw that you are setting the FailFastStrategy - are you setting it by default on the environment? Note that if so it is not supported anymore and marked as internal. What is the motivation behind using it?

Actually separately, can you share debug logs (maybe via PN) when the issues are happening?

I set the retry strategy on the incrementOptions/getOptions, not on environment config, and I actually tested bestEffortStrategy, FailFastStrategy and also custom one. all ended with the same results. I can’t share the logs, since it is a production machine. The logs you see above are exactly the moment when the endpoint not writable starts. I can’y understand why I get so many retries and why I even get the timeout. I also played with maxKvConnections (not sure what the number should be).