Seeing com.couchbase.client.core.BackpressureException when querying couchbase on load

I would like to understand in what scenarios do we get BackPressureException and how to handle this apart from retry logic.

We’re seeing this exception when doing a load test and it happens for even simple bucket.get(“key”) [there will be hundreds of such calls though at given instance]. My understanding is that couchbase server is unable to keep up with all requests sent by client (java sdk) ? Or is it client which is not able to keep up ?
Apart from retry logic I was hoping we can try tweaking “I/O Thread Pool Size” or “Request Ring Buffer Size” (this must just delay BackPressureException from occurring ?).

Also ,does it make sense to try & increase “Key/Value Endpoints per Node” & “Query Endpoints per Node” ? I’d like to understand in what scenarios & what implications do these have ?

Other details

  • Couchbase server 4.5

  • couchbase java sdk - 2.3.6

  • Using AsyncBucket connection

  • Just bucket.get() and N1QL queries. No uploads of document.

Let me know if other details are required.

@sridharvadigi you’ll see BackpressureExceptions when the ringbuffer (which is bounded) is full while you are trying to write into the client. This simply means that you are pushing faster into the client than it can handle to “drain” the ringbuffer.

Naturally, there are two solutions: 1) back off and retry if you get the exception and 2) figure out why you are writing too fast. Is it an environmental issue or something else.

Can you shed more light on your server and client setup, the throughput/latency excpectations and what are you seeing right now? After we have more info we can get into tuning, but I’d like to focus first on the setup.

Couchbase server comprises of 5 nodes. 3 are for data & 2 for index/query services.
50GB RAM is allocated to each and hardly 3GB is used by each node.
When running performance test I observed all nodes are healthy. . CPU utilization is also fine (<18%).
ops/sec ranges from 600 to 1500 ops/sec. I believe this is not much of a load on server. There are two main buckets.

On client side it’s a java based REST application (with akka in background). We open connection to each bucket only once and this bucket connection is shared across all calls. Load test data setup is such that for 1 request we generate approx 1100 keys & do bucket.gets() on first bucket & on other bucket we fire 20 N1QL queries (almost concurrently) and generate approx 1200 keys & do bucket.gets().

We observe BackpressureException the moment we go for more than 5 or 6 users(concurrent requests).Since couchbase server stats (graphs) seem to be fine I am inclined to fine tune client sdk.

1 Like

@sridharvadigi thanks - you mentioned akka. Are you accessing the SDK in a sync or an async fashion? The default ringbuffer size is 16k so you need to have at least 16k open requests queued up before you get a BackpressureException. How do you control the “input path”?

Oh one thing you can also try if you mix n1ql and kv is to use 2.4.3 which has better scaling for N1QL automatically. In addition once upgraded you can try setting kvEndpoints to a higher setting like 4 or so if you have massively parallel kv fetches.

@daschl We are accessing SDK in async manner. Input rate is just based on number active threads/actors trying to access data from cocuhbase.
Upgrading kvEndpoints & query-end-points has fixed BackPressureExceptions. Is there limit to these values ?
Also I thought io-pool-size will play some role here.Can you please explain what these parameters are or point to some online documentation. From couchbase page I get vague idea on these but not complete picture.

Also, we use microservices and all of them are deployed in cloud platform which share underlying hard-ware resources (CPU/RAM etc) . Each microservice establishes it’s own connection to couchbase. Do you see a problem if I have these settings done for each microservice ?
Like microservice A uses 4 kvEndpoints , microservice B uses 4 and so on. We’ve about 20 microservices. Lately I’m seeing few issues in my microservices where N1QL queries are getting held up. Not getting response and these are timing out. Same queries worked earlier and if run through workbench they return within 20millisec. Not sure if this is because of kvEndpoints & query-end-points changes.