Hello,
We work in a multinational company that produces diesel engines and is working on an IoT platform to analyze engine performance based on sensor data. We are using Flink for deploying analytics stream processing jobs. We recently integrated these jobs with Couchbase (serving as a Cache) and are monitoring the performance of these jobs in our test environment.
Couchbase Cluster
Two nodes (4 cpu, 16 GB, Amazon Linux 2, 25 GB Disk, Community Edition 6.5.1 build 6299)
Couchbase SDK
java-client (3.0.10)
Flink Cluster
Two Job Managers (2 cpu, 8 GB, Centos 7, 50 GB Disk, Apache Flink 1.9.0)
Six Task Managers (4 cpu, 16 GB, Centos 7, 50 GB Disk, Apache Flink 1.9.0)
We noticed that after re-deploying our analytics Flink jobs, high CPU usage alert is seen on Flink Job Manager.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1831 centos 20 0 11.3g 5.8g 6076 S 162.5 77.3 17181:30 java
In Job Manager log, following trace message was seen several times (357 times and increasing).
2021-01-27 07:05:27,911 ERROR reactor.core.publisher.Operators - Operator called default onErrorDropped
com.couchbase.client.core.error.UnambiguousTimeoutException: CarrierBucketConfigRequest, Reason: TIMEOUT {"cancelled":true,"completed":true,"coreId":"0xc7ed561a00000001","idempotent":true,"lastChannelId":"C7ED561A00000001/000000002D9FA607","lastDispatchedFrom":"10.21.151.206:53452","lastDispatchedTo":"couchbase1.dev.io:11210","reason":"TIMEOUT","requestId":1401296,"requestType":"CarrierBucketConfigRequest","retried":0,"service":{"bucket":"service-configuration","collection":"_default","opaque":"0x1561e5","scope":"_default","target":"couchbase1.dev.io","type":"kv"},"timeoutMs":2500,"timings":{"dispatchMicros":3928190}}
at com.couchbase.client.core.msg.BaseRequest.cancel(BaseRequest.java:163)
at com.couchbase.client.core.Timer.lambda$register$2(Timer.java:157)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
at com.couchbase.client.core.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
2021-01-27 07:05:30,010 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
It was observed that the Job Manager process (1831) comprised of 60.9% (873 out of 1433) threads related to couchbase timers of some sort (e.g. cb-timer-1-1, cb-events, cb-tracing-1, cb-orphan-1 etc.). This may be the reason for high CPU usage.
$ ps -eT | grep 1831 | grep -i cb | wc -l
873
$ ps -eT | grep 1831 | wc -l
1433
$ ps -eT | grep 1831 | grep -i cb
1831 11539 ? 01:45:52 cb-timer-1-1
1831 11541 ? 00:11:53 cb-events
1831 11542 ? 00:10:13 cb-tracing-1
1831 11543 ? 00:10:08 cb-orphan-1
1831 11545 ? 00:04:47 cb-comp-1
1831 11546 ? 00:06:41 cb-comp-2
1831 11547 ? 00:34:57 cb-io-kv-5-1
1831 11549 ? 00:17:48 cb-io-kv-5-2
1831 24911 ? 00:43:07 cb-timer-1-1
1831 24912 ? 00:05:34 cb-events
1831 24913 ? 00:04:02 cb-tracing-1
1831 24914 ? 00:04:02 cb-orphan-1
1831 24915 ? 00:02:35 cb-comp-1
1831 24916 ? 00:04:17 cb-comp-2
1831 24917 ? 00:23:41 cb-io-kv-5-1
1831 24919 ? 00:13:08 cb-io-kv-5-2
1831 24966 ? 00:44:37 cb-timer-1-1
1831 24967 ? 00:05:43 cb-events
...
Couchbase SDK Java Client connection code snippet is provided below for reference.
ClusterEnvironment env = ClusterEnvironment.builder()
.thresholdRequestTracerConfig(ThresholdRequestTracerConfig.emitInterval(Duration.ofHours(1)))
.orphanReporterConfig(OrphanReporterConfig.emitInterval(Duration.ofHours(1)))
.ioConfig(IoConfig.maxHttpConnections(maxHttpConnections))
.build();
Cluster cluster = Cluster.connect(host, ClusterOptions.clusterOptions(user, pwd).environment(env));
Please help in understanding possible reasons for creation of numerous cb-*
threads and UnambiguousTimeoutException
from Couchbase SDK. If you have any other suggestions or commands to better troubleshoot this issue, those are welcome too.
Thank you.