com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying

Hello,

We are seeing a lot of these exceptions during our load tests of 100 writes per second. I went through the existing posts on this topic but they all seem to be environment related rather than load. I’ve tried doubling the QueryServiceConfig max to 24 (since the queryEndpoints method is deprecated) but that did not help at all.

com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.
at com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51)
at com.couchbase.client.core.service.PooledService.send(PooledService.java:331)
at com.couchbase.client.core.node.CouchbaseNode.send(CouchbaseNode.java:200)
at com.couchbase.client.core.node.locate.QueryLocator.locateAndDispatch(QueryLocator.java:72)
at com.couchbase.client.core.RequestHandler.dispatchRequest(RequestHandler.java:250)
at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:201)
at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:77)
at com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)

Couchbase server: Version: 4.5.0-2601 Community Edition (build-2601), 3 Nodes
java client: com.couchbase.client.java-client-2.5.0.jar

Much appreciated,

-K

Can you give a bit more info about the configuration and state of the system when you get these? Does your log show any connectivity issues?

Normally, a cancellation indicates it was dispatched or the client couldn’t get it to an endpoint that can handle the request, and you’ve configured it to fast fail. If it’s happening under load, it could be an issue in the client or a crash in the query service.

We introduced automatic endpoint pooling more recently, so it could be an as-yet-unidentified issue there too. See if there’s something in the logs that indicates this if you could. You might also for diagnosis, if it’s simple, try 2.4.1 with manual kv endpoint pooling since that’s the version before the automatic pooling was introduced as indicated in the release notes.

Hello,

I reran the load test using the 2.5.0 driver and placed the TRACE logs here: https://gist.github.com/khalidr/fd75840cc3803f4b133a187b59346ec2. These the only configurations set on the DefaultCouchbaseEnvironment:
kvTimeout=2500
keepAliveInterval=30000

Thanks for the snippet. Can you give me some context on what’s happening there in the app? It appears to be all over a ~60ms interval from bucket open to making the request to seeing the cancellation exception. But, there are also a lot of other retries, possibly from the other bucket object in use by other threads? The bucket appears to already be opened from the environment.

@daschl or @subhashni may have some other observations from the logs.

Hi ingenthr,

We are running a load test which does a query to get the sounds and then updates fields on 2 documents for a given customer.

Here is the query:

SELECT meta().id as topic, array_count(m.someField) as count
FROM $CURRENT_BUCKET_IDENTIFIER m
WHERE meta(m).id in ["fooId"]

Then we update a document that looks like this:

{"field2":"value",
 "field1":"v2"
}

I use the AsyncBucket.mutateIn method to modify specific fields.

Here is a code snippet (in scala):

 val values = partialFormats.writes(partial).as[JsObject].value

    def modifyOperation[T](builder:AsyncMutateInBuilder):(String, T) => AsyncMutateInBuilder =
      if(modifyType == Upsert) builder.replace(_,_) else builder.upsert(_,_)

    def update = {
      for{
        b <- bucket
        _ <- {
          val builder = b.mutateIn(id)
          values foreach {case (field, js) =>
            modifyOperation(builder)(field, fromJsValue(js))
          }
          builder.execute().toScala
        }
      } yield ()
    }.future

    if(values.isEmpty) Future.successful(()) else update

The second document looks like this:

{
“id”: “foo”,
“someField”: [
“some_string::123”
],
“type”: “Foo”
}

Here, I append values to the someField array if it does not already exist. The entry point is the insert method below which in turn calls append method:

 // appends/adds the list of Foo to the array
  override def insert(e: Foo) = {
    for {
      b <- bucket
      _ <- b.insert(e.asDocument()).toScala.onErrorResumeNext {
        case _: DocumentAlreadyExistsException => append(e)
        case t => Observable.error(t)
      }
    } yield ()
  }.future


  private def append(foo: Foo) = {

    for {
      b <- bucket
      _ <- {
        val builder = b.mutateIn(id.path)
        foo.someField.foreach { id =>
          builder.arrayAddUnique("someField", id.toString, new SubdocOptionsBuilder().createParents(true))
        }
        builder.execute().toScala
      }
    } yield ()
  }

The customers are mutually exclusive so we need not worry multiple threads competing to update the same document.

Thanks,

-K

Hi ingenthr,

I’ve updated the above comment with another step that occurs when this error is observed.

Thanks,

-K

@k_reid looking at the logs, for some reasons your N1QL queries can’t be dispatched to the target node and therefore are retried. The log is a bit short and I think the root cause of the issue is happening before the log snippet starts. Is it possible that you collect a longer one with more time before the issue starts happening?

Hi @daschl @ingenthr ,

I upgraded the java-client to 2.5.1, fixed an error that was causing an unnecessary query and the exception went away.

Thanks for your help,

K

Hello @daschl @ingenthr,

This error has resurfaced in our application during a load test at only 100 requests/second. The difference now is that we are doing a create rather than in update. The operations performed are to first insert a new document and then proceed to create/append to another document, exactly as stated above. Here are the steps.

1. Get counts

SELECT meta().id as topic, array_count(m.someField) as count
FROM $CURRENT_BUCKET_IDENTIFIER m
WHERE meta(m).id in ["fooId"]

2. Insert a document:

{"field2":"value",
 "field1":"v2",
 "field3":["/a/b/c"]
}

3. Insert/Append to another document:

The document looks like this.

{
  “id”: “foo”,
  “someField”: [“some_string::123”],
  “type”: “Foo”
}

Here, I append values to the someField array if it does not already exist. The entry point is the insert method below which will possibly call the append method:

 // appends/adds the list of Foo to the array
  override def insert(e: Foo) = {
    for {
      b <- bucket
      _ <- b.insert(e.asDocument()).toScala.onErrorResumeNext {
        case _: DocumentAlreadyExistsException => append(e)
        case t => Observable.error(t)
      }
    } yield ()
  }.future


  private def append(foo: Foo) = {

    for {
      b <- bucket
      _ <- {
        val builder = b.mutateIn(id.path)
        foo.someField.foreach { id =>
          builder.arrayAddUnique("someField", id.toString, new SubdocOptionsBuilder().createParents(true))
        }
        builder.execute().toScala
      }
    } yield ()
  }

I suspect that step 1 and 3 are not the problem since my update scenario does them and seems to be working fine. I have the TRACE logs but I do not want to post them here to preserve company security. Is there a way I can send them to someone directly?

Thanks,

-K

If you have an Enterprise Subscription, our support team have a way to receive the logs for analysis.

Otherwise, do you see any indication in the logs of a connection being dropped? Also, does the cluster side show any crashes happening in the various logs? It feels much like something is going wrong with the connection that is leading to the inability of the client to dispatch. A crash would cause that sort of thing.

Hi @ingenthr,

I think I’ve narrowed things down to being a problem with the query in step 1 above:

SELECT meta().id as topic, array_count(m.someField) as c
FROM $CURRENT_BUCKET_IDENTIFIER m
WHERE meta(m).id in ["fooId"]

When I remove this query, the RequestCancelledException goes away. I’ve also tried using a parameterized query and set adhoc=false like this:

        s"""
           |SELECT meta().id as topic, array_count(m.someField) as c
           |FROM $CURRENT_BUCKET_IDENTIFIER m
           |USE KEYS $$fields""".stripMargin

      N1qlQuery
        .parameterized(
          statement, JsonObject.create().put("fields", subscriptions.map(_.path).asJava),
          N1qlParams.build().consistency(ScanConsistency.REQUEST_PLUS).adhoc(false))

This made the RequestCancelledException go away but now the http connection to my rest service is timing out after 1 minute. It looks like the query is not returning in time. Any advice on how to optimize this query?

Thanks,

K

FYI: Bucket name can’t be parameterized because it needs to be identifier only values can be parametrized. Did you try providing actual bucket name.

Hi @vsr1,

Its a bit hard to read but I am using scala string interpolation which also uses $. The only field being parameterized is fields:
JsonObject.create().put("fields", subscriptions.map(_.path).asJava)

Hello @ingenthr @daschl @vsr1 ,

Help with this issue would be much appreciated.

Thanks,

K

hey @k_reid, apologies for the delay. All of us were very busy last week with Couchbase Connect SV, our event in San Jose.

Based on the above, it looks like you’re not parameterizing the bucket name, which is good. I believe the client would only have a RequestCancelledException in the event that the connection is disrupted (right @daschl?).

Since you can reproduce this pretty much on demand, can you run it and send me directly the client logs at DEBUG/FINE level? Might also be good to get the query.log to correlate and any indication if times are out of sync. I’ll message you a place to send them.

At the moment, it feels like the problem might be something causing the query engine to crash/restart, as the client’s role here is just to package and send simple requests and unpackage the responses. The RequestCancelledException is something you’d see from the client if it writes a request to the network and the other side closes the connection without responding. The logs will help us disprove that theory so we can move on to the next one. :slight_smile:

From the exception stack trace the PooledService is not able to dispatch, which means that either all sockets are busy OR not available in the first place for n1ql and it has been circulating around.

Debug logs would be great, and also you can try to increase the query pool size in the envrionment config to get “a bigger pipe” . this of course only works if the connections work in the first place which we can find out through the debug logs.

@k_reid shared the logs with me, and looking them over, I think @daschl’s suggestion of being out of connections may be the most likely. Connections look healthy. The thing is, he’s on client 2.5.1, so there should be pooling by default on queryEndpoints, right @daschl?

Note the queries are with request_plus and many of them run for ~1.5s.

A future enhancement, it might be nice to log at the “could not dispatch request” line that we have consumed X out of Y endpoints, or something like that.

@k_reid: to diagnose, add this to your CouchbaseEnvironment:
.queryServiceConfig(QueryServiceConfig.create(10,200))

You’ll want to calculate a realistic number, but if that helps, you just need a wider highway to the query service given the number of concurrent requests.

Ok thanks @ingenthr I will try this. What is the best way to calculate a
realistic number?

ingenthr https://www.couchbase.com/forums/u/ingenthr Couchbase
November 1

@k_reid https://www.couchbase.com/forums/u/k_reid shared the logs with me,
and looking them over, I think @daschl
https://www.couchbase.com/forums/u/daschl’s suggestion of being out of
connections may be the most likely. Connections look healthy. The thing is,
he’s on client 2.5.1, so there should be pooling by default on
queryEndpoints, right @daschl https://www.couchbase.com/forums/u/daschl?

Note the queries are with request_plus and many of them run for ~1.5s.

A future enhancement, it might be nice to log at the “could not dispatch
request” line that we have consumed X out of Y endpoints, or something like
that.

@k_reid https://www.couchbase.com/forums/u/k_reid: to diagnose, add this to
your CouchbaseEnvironment:
.queryServiceConfig(QueryServiceConfig.create(10,200))

You’ll want to calculate a realistic number, but if that helps, you just
need a wider highway to the query service given the number of concurrent
requests.

Visit Topic
https://www.couchbase.com/forums/t/com-couchbase-client-core-requestcancelledexception-could-not-dispatch-request-cancelling-instead-of-retrying/14325/17
or reply to this email to respond.

In Reply To
daschl https://www.couchbase.com/forums/u/daschl SDK Engineer
October 31
From the exception stack trace the PooledService is not able to dispatch,
which means that either all sockets are busy OR not available in the first
place for n1ql and it has been circulating around. Debug logs would be
great, and also you can try to increase the query pool size in the
envrionment …
Previous Replies
daschl https://www.couchbase.com/forums/u/daschl SDK Engineer
October 31

From the exception stack trace the PooledService is not able to dispatch,
which means that either all sockets are busy OR not available in the first
place for n1ql and it has been circulating around.

Debug logs would be great, and also you can try to increase the query pool
size in the envrionment config to get “a bigger pipe” . this of course only
works if the connections work in the first place which we can find out
through the debug logs.
ingenthr https://www.couchbase.com/forums/u/ingenthr Couchbase
October 30

hey @k_reid https://www.couchbase.com/forums/u/k_reid, apologies for the
delay. All of us were very busy last week with Couchbase Connect SV, our
event in San Jose https://connect.couchbase.com/us/silicon-valley.

Based on the above, it looks like you’re not parameterizing the bucket
name, which is good. I believe the client would only have a
RequestCancelledException in the event that the connection is disrupted
(right @daschl https://www.couchbase.com/forums/u/daschl?).

Since you can reproduce this pretty much on demand, can you run it and
send me directly the client logs at DEBUG/FINE level
https://developer.couchbase.com/documentation/server/5.0/sdk/java/collecting-information-and-logging.html?
Might also be good to get the query.log to correlate
https://developer.couchbase.com/documentation/server/current/clustersetup/ui-logs.html
and any indication if times are out of sync. I’ll message you a place to
send them.

At the moment, it feels like the problem might be something causing the
query engine to crash/restart, as the client’s role here is just to package
and send simple requests and unpackage the responses. The
RequestCancelledException is something you’d see from the client if it
writes a request to the network and the other side closes the connection
without responding. The logs will help us disprove that theory so we can
move on to the next one. [image: :slight_smile:]
k_reid https://www.couchbase.com/forums/u/k_reid
October 25

Hello @ingenthr https://www.couchbase.com/forums/u/ingenthr @daschl
https://www.couchbase.com/forums/u/daschl @vsr1
https://www.couchbase.com/forums/u/vsr1 ,

Help with this issue would be much appreciated.

Thanks,

K
k_reid https://www.couchbase.com/forums/u/k_reid
October 18

Hi @vsr1 https://www.couchbase.com/forums/u/vsr1,

Its a bit hard to read but I am using scala string interpolation which
also uses $. The only field being parameterized is fields:
JsonObject.create().put(“fields”, subscriptions.map(_.path).asJava)
vsr1 https://www.couchbase.com/forums/u/vsr1 Couchbase
October 18

k_reid:

$CURRENT_BUCKET_IDENTIFIER

FYI: Bucket name can’t be parameterized because it needs to be identifier
only values can be parametrized. Did you try providing actual bucket name.

Visit Topic

https://www.couchbase.com/forums/t/com-couchbase-client-core-requestcancelledexception-could-not-dispatch-request-cancelling-instead-of-retrying/14325/17
or reply to this email to respond.

To unsubscribe from these emails, click here

In effect, you’ll need each instance to have a minimum associated with the number of concurrent queries you plan to use, but have the maximum set at a safety valve that it won’t overwhelm your query service if load goes up beyond what you’d expect.

For instance, if you were to have one query node, one application server with one instance of an SDK, something like min of 10 max of 1000 is not bad. Ports will be shut down when idle automatically.

But if you had the same params and had forty application servers, you could go up to 40000 connections which may be more than your single query service could handle. That’s especially true if you have that query service running on a system with limited resources.

Enterprise subscribers can get specific help for their needs from Couchbase directly.

Hello @ingenthr,

The error has not resurfaced its ugly head since I’ve increased the number query endpoints. If I see it again, I will reach out to you.

Thanks for your help,

K