Numerous threads (cb-timer, cb-events, cb-tracing, cb-orphan, cb-comp, cb-io-kv) observed

Hello,

We work in a multinational company that produces diesel engines and is working on an IoT platform to analyze engine performance based on sensor data. We are using Flink for deploying analytics stream processing jobs. We recently integrated these jobs with Couchbase (serving as a Cache) and are monitoring the performance of these jobs in our test environment.

Couchbase Cluster
Two nodes (4 cpu, 16 GB, Amazon Linux 2, 25 GB Disk, Community Edition 6.5.1 build 6299)

Couchbase SDK
java-client (3.0.10)

Flink Cluster
Two Job Managers (2 cpu, 8 GB, Centos 7, 50 GB Disk, Apache Flink 1.9.0)
Six Task Managers (4 cpu, 16 GB, Centos 7, 50 GB Disk, Apache Flink 1.9.0)

We noticed that after re-deploying our analytics Flink jobs, high CPU usage alert is seen on Flink Job Manager.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1831 centos    20   0   11.3g   5.8g   6076 S 162.5 77.3  17181:30 java

In Job Manager log, following trace message was seen several times (357 times and increasing).

2021-01-27 07:05:27,911 ERROR reactor.core.publisher.Operators                              - Operator called default onErrorDropped
com.couchbase.client.core.error.UnambiguousTimeoutException: CarrierBucketConfigRequest, Reason: TIMEOUT {"cancelled":true,"completed":true,"coreId":"0xc7ed561a00000001","idempotent":true,"lastChannelId":"C7ED561A00000001/000000002D9FA607","lastDispatchedFrom":"10.21.151.206:53452","lastDispatchedTo":"couchbase1.dev.io:11210","reason":"TIMEOUT","requestId":1401296,"requestType":"CarrierBucketConfigRequest","retried":0,"service":{"bucket":"service-configuration","collection":"_default","opaque":"0x1561e5","scope":"_default","target":"couchbase1.dev.io","type":"kv"},"timeoutMs":2500,"timings":{"dispatchMicros":3928190}}
        at com.couchbase.client.core.msg.BaseRequest.cancel(BaseRequest.java:163)
        at com.couchbase.client.core.Timer.lambda$register$2(Timer.java:157)
        at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
        at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
        at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
        at com.couchbase.client.core.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
2021-01-27 07:05:30,010 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: SUSPENDED

It was observed that the Job Manager process (1831) comprised of 60.9% (873 out of 1433) threads related to couchbase timers of some sort (e.g. cb-timer-1-1, cb-events, cb-tracing-1, cb-orphan-1 etc.). This may be the reason for high CPU usage.

$ ps -eT | grep 1831 | grep -i cb | wc -l
873

$ ps -eT | grep 1831 | wc -l
1433

$ ps -eT | grep 1831 | grep -i cb
 1831 11539 ?        01:45:52 cb-timer-1-1
 1831 11541 ?        00:11:53 cb-events
 1831 11542 ?        00:10:13 cb-tracing-1
 1831 11543 ?        00:10:08 cb-orphan-1
 1831 11545 ?        00:04:47 cb-comp-1
 1831 11546 ?        00:06:41 cb-comp-2
 1831 11547 ?        00:34:57 cb-io-kv-5-1
 1831 11549 ?        00:17:48 cb-io-kv-5-2
 1831 24911 ?        00:43:07 cb-timer-1-1
 1831 24912 ?        00:05:34 cb-events
 1831 24913 ?        00:04:02 cb-tracing-1
 1831 24914 ?        00:04:02 cb-orphan-1
 1831 24915 ?        00:02:35 cb-comp-1
 1831 24916 ?        00:04:17 cb-comp-2
 1831 24917 ?        00:23:41 cb-io-kv-5-1
 1831 24919 ?        00:13:08 cb-io-kv-5-2
 1831 24966 ?        00:44:37 cb-timer-1-1
 1831 24967 ?        00:05:43 cb-events
...

Couchbase SDK Java Client connection code snippet is provided below for reference.

ClusterEnvironment env = ClusterEnvironment.builder()
  .thresholdRequestTracerConfig(ThresholdRequestTracerConfig.emitInterval(Duration.ofHours(1)))
  .orphanReporterConfig(OrphanReporterConfig.emitInterval(Duration.ofHours(1)))
  .ioConfig(IoConfig.maxHttpConnections(maxHttpConnections))
  .build();
Cluster cluster = Cluster.connect(host, ClusterOptions.clusterOptions(user, pwd).environment(env));

Please help in understanding possible reasons for creation of numerous cb-* threads and UnambiguousTimeoutException from Couchbase SDK. If you have any other suggestions or commands to better troubleshoot this issue, those are welcome too.

Thank you.

@vinaya1 it is best practice to only ever open one Cluster instance (including one environment) and share it across the whole instance. Seeing this many threads usually points to the fact that not only one, but many of those instances are somehow created.

Hi Vinaya,

Out of curiosity, what are you using as the Flink source and sink? I ask because there’s an experimental Couchbase connector for Flink that might be useful if you’re using Couchbase as a source or sink.

Thanks,
David

Hi David,

Thank you for the information regarding Flink couchbase connector.
The Flink source and sink in our pipeline is Kafka. Couchbase is being used as a Cache.

Kind regards,
Vinaya

Hi @daschl

Thanks for your response !

We create only one Cluster instance per Analytical Job. Currently, we have ~20 jobs.

After restart of Flink Job Manager, high CPU is no longer seen and all jobs are operating as expected. Further, currently there are no couchbase related threads running on the Job Manager.

$ ps -eT | grep 1791 | grep cb | wc -l
0

$ ps -eT | grep 1791
 1791  1791 ?        00:00:00 java
 1791  1801 ?        00:00:04 java
 1791  1803 ?        00:00:10 java
 1791  1804 ?        00:00:09 java
 1791  1808 ?        00:02:23 VM Thread
 1791  1811 ?        00:00:00 Reference Handl
 1791  1812 ?        00:00:01 Finalizer
 1791  1817 ?        00:00:00 Signal Dispatch
 1791  1818 ?        00:01:51 C2 CompilerThre
 1791  1819 ?        00:00:28 C1 CompilerThre
 1791  1820 ?        00:00:00 Service Thread
 1791  1821 ?        00:33:25 VM Periodic Tas
 1791  2639 ?        02:53:30 flink-scheduler
 1791  2766 ?        00:35:48 flink-akka.remo
 1791  2767 ?        00:35:52 flink-akka.remo
 1791  2865 ?        00:03:13 New I/O worker
 1791  2866 ?        00:03:11 New I/O worker
 1791  2874 ?        00:03:15 New I/O boss #3
 1791  2876 ?        00:03:56 New I/O worker
 1791  2880 ?        00:04:13 New I/O worker
 1791  2881 ?        00:00:00 New I/O server
 1791  3106 ?        00:00:00 org.apache.flin
 1791  3540 ?        00:00:02 java-sdk-http-c
 1791  3654 ?        00:00:00 Curator-Connect
 1791  3674 ?        00:01:37 main-SendThread
 1791  3676 ?        00:00:07 main-EventThrea
 1791  3681 ?        00:00:00 Curator-Framewo
 1791  3684 ?        00:00:00 Timer-0
 1791  3686 ?        00:00:00 BLOB Server lis
 1791  3711 ?        00:00:11 server-timer
 1791  3714 ?        00:02:13 Thread-7
 1791  3719 ?        04:05:51 flink-metrics-s
 1791  3725 ?        02:14:07 flink-metrics-a
 1791  3726 ?        02:13:06 flink-metrics-a
 1791  3728 ?        00:03:17 New I/O worker
 1791  3732 ?        00:03:15 New I/O worker
 1791  3736 ?        00:03:15 New I/O boss #9
 1791  3742 ?        00:03:10 New I/O worker
 1791  3745 ?        00:03:10 New I/O worker
 1791  3746 ?        00:00:00 New I/O server
 1791  4171 ?        00:01:48 flink-rest-serv
 1791  4253 ?        00:04:36 jobmanager-futu
 1791  4255 ?        00:00:00 Curator-PathChi
 1791  4330 ?        00:12:36 Hashed wheel ti
 1791  6994 ?        00:00:37 pool-4-thread-1
 1791  7934 ?        00:00:37 pool-4-thread-2
 1791  7936 ?        00:00:38 pool-4-thread-3
 1791  7937 ?        00:00:37 pool-4-thread-4
 1791  7940 ?        00:00:39 pool-4-thread-5
 1791  7942 ?        00:02:22 flink-rest-serv
 1791  7943 ?        00:02:18 flink-rest-serv
 1791  7971 ?        00:01:49 flink-rest-serv
 1791  7972 ?        00:01:48 flink-rest-serv
 1791  7977 ?        00:00:02 Flink-Dispatche
 1791  7978 ?        00:00:02 Flink-Dispatche
 1791  7979 ?        00:00:02 Flink-Dispatche
 1791  7980 ?        00:13:54 Hashed wheel ti
 1791  7982 ?        00:00:02 Flink-Dispatche
 1791  8046 ?        00:04:31 jobmanager-futu
 1791 29171 ?        00:00:00 FlinkCompletabl
 1791 29176 ?        00:00:00 Checkpoint Time
 1791 29181 ?        00:00:00 Checkpoint Time
 1791 29185 ?        00:00:00 Checkpoint Time
 1791 29188 ?        00:00:00 Checkpoint Time
 1791 29190 ?        00:00:00 Checkpoint Time
 1791 29191 ?        00:00:00 Checkpoint Time
 1791 29204 ?        00:00:00 Checkpoint Time
 1791 29229 ?        00:00:00 Checkpoint Time
 1791 29230 ?        00:00:00 Checkpoint Time
 1791 29231 ?        00:00:00 Checkpoint Time
 1791 29232 ?        00:00:00 Checkpoint Time
 1791 29233 ?        00:00:00 Checkpoint Time
 1791 29234 ?        00:00:00 Checkpoint Time
 1791 29235 ?        00:00:00 Checkpoint Time
 1791 29236 ?        00:00:00 Checkpoint Time
 1791 29237 ?        00:00:00 Checkpoint Time
 1791 29238 ?        00:00:00 Checkpoint Time
 1791 29239 ?        00:00:00 Checkpoint Time
 1791 29240 ?        00:00:00 Checkpoint Time
 1791 29241 ?        00:00:00 Checkpoint Time
 1791 29242 ?        00:00:00 Checkpoint Time
 1791 29243 ?        00:00:00 Checkpoint Time
 1791 29244 ?        00:00:00 Checkpoint Time
 1791 29245 ?        00:00:00 Checkpoint Time
 1791 29246 ?        00:00:00 Checkpoint Time
 1791 29247 ?        00:00:00 Checkpoint Time
 1791 29248 ?        00:00:00 Checkpoint Time
 1791 29249 ?        00:00:00 Checkpoint Time
 1791 29250 ?        00:00:00 Checkpoint Time
 1791 29251 ?        00:00:00 Checkpoint Time
 1791 29252 ?        00:00:00 Checkpoint Time
 1791 29253 ?        00:00:00 Checkpoint Time
 1791 29254 ?        00:00:00 Checkpoint Time
 1791 29255 ?        00:00:00 Checkpoint Time
 1791 29256 ?        00:00:00 Checkpoint Time
 1791 29257 ?        00:00:00 Checkpoint Time
 1791 29258 ?        00:00:00 Checkpoint Time
 1791 29259 ?        00:00:00 Checkpoint Time
 1791 29260 ?        00:00:00 Checkpoint Time
 1791 29261 ?        00:00:00 Checkpoint Time
 1791 29262 ?        00:00:00 Checkpoint Time
 1791 29263 ?        00:00:00 Checkpoint Time
 1791 29264 ?        00:00:00 Checkpoint Time
 1791 29265 ?        00:00:00 Checkpoint Time
 1791 29266 ?        00:00:00 Checkpoint Time
 1791 29267 ?        00:00:00 Checkpoint Time
 1791 29271 ?        00:00:00 Checkpoint Time
 1791 29273 ?        00:00:00 Checkpoint Time
 1791 29275 ?        00:00:00 Checkpoint Time
 1791 29280 ?        00:00:00 Checkpoint Time
 1791 29283 ?        00:00:00 Checkpoint Time
 1791 29287 ?        00:00:00 Checkpoint Time
 1791 29289 ?        00:00:00 Checkpoint Time
 1791 29293 ?        00:06:13 cluster-io-thre
 1791 29296 ?        00:06:13 cluster-io-thre
 1791 26146 ?        00:00:00 flink-akka.acto
 1791 26451 ?        00:00:00 flink-akka.acto
 1791 26672 ?        00:00:00 flink-akka.acto
 1791 27086 ?        00:00:00 flink-akka.acto
 1791 27099 ?        00:00:00 flink-akka.acto
 1791 27623 ?        00:00:00 flink-akka.acto
 1791 27703 ?        00:00:00 flink-akka.acto
 1791 27779 ?        00:00:00 flink-akka.acto

However, we suspect that this may occur again after some days. Do we know under what exceptional scenarios cb-* threads may be created or UnambiguousTimeoutException be thrown from Couchbase SDK?

Kind regards,
Vinaya

Are all jobs running in a single JVM? Since then you would end up with 20 cluster instances, which would explain this. To be clear, the thread pools are pre-allocated during construction, so those would not just “show up” over time. I would recommend to create the Cluster in your JVM as a singleton and share it across jobs so you only ever end up with exactly one instance per JVM to avoid this situation.

The exception you mentioned is just a “timeout” - the specific operation timed out, and in this instance it could be because you create many pools on the fly and it exhausts your resources.