Couchbase metrics and Java SDK

I have an issue in an HCL XPages application that performs rather slow. I am not sure it is related to Couchbase at all - but I get metrics like this written to the console:

[012918:004947-00007FCE84CB3700] 16-04-2024 12:29:04   HTTP JVM: [cb-events] INFO com.couchbase.metrics - [com.couchbase.metrics][LatencyMetricsAggregatedEvent][600s] Aggregated Latency Metrics: {"operations":{"query":{"query":{"total_count":6363,"percentiles_us":{"50.0":287309.823,"90.0":692060.159,"99.0":1224736.767,
[012918:004947-00007FCE84CB3700] 16-04-2024 12:29:04   HTTP JVM: 99.9":2105540.607,"100.0":3137339.391}}},"kv":{"get":{"total_count":3318514,"percentiles_us":{"50.0":210763.775,"90.0":532676.607,"99.0":721420.287,"99.9":801112.063,"100.0":2264924.159}},"upsert":{"total_count":4,"percentiles_us":{"50.0":528.383,"90.0":1
[012918:004947-00007FCE84CB3700] 16-04-2024 12:29:04   HTTP JVM: 22.303,"99.0":1122.303,"99.9":1122.303,"100.0":1122.303}}}},"meta":{"emit_interval_s":600}}
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: [cb-events] WARN com.couchbase.tracing - [com.couchbase.tracing][OverThresholdRequestsRecordedEvent][120s] Requests over Threshold found: {"query":{"top_requests":[{"operation_name":"query","last_dispatch_duration_us":903473,"last_remote_socket":"db1.xxxx
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: .dk:8093","last_local_socket":"10.42.208.10:21620","total_dispatch_duration_us":903473,"timeout_ms":100000,"total_duration_us":1847052},{"operation_name":"query","last_dispatch_duration_us":1078288,"last_remote_socket":"db2.xxxx.dk:8093","last_loca
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: _socket":"10.42.208.10:4464","total_dispatch_duration_us":1078288,"timeout_ms":100000,"total_duration_us":1767941},{"operation_name":"query","last_dispatch_duration_us":584634,"last_remote_socket":"db2.xxxx.dk:8093","last_local_socket":"10.42.208.10:4
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: 88","total_dispatch_duration_us":584634,"timeout_ms":100000,"total_duration_us":1523604},{"operation_name":"query","last_dispatch_duration_us":1494855,"last_remote_socket":"db1.xxxx.dk:8093","last_local_socket":"10.42.208.10:22270","total_dispatch_dur
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: tion_us":1494855,"timeout_ms":100000,"total_duration_us":1494962},{"operation_name":"query","last_dispatch_duration_us":625511,"last_remote_socket":"db2.xxxx.dk:8093","last_local_socket":"10.42.208.10:4892","total_dispatch_duration_us":625511,"timeout
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: ms":100000,"total_duration_us":1429927},{"operation_name":"query","last_dispatch_duration_us":1408742,"last_remote_socket":"db1.xxxx.dk:8093","last_local_socket":"10.42.208.10:21430","total_dispatch_duration_us":1408742,"timeout_ms":100000,"total_dura
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: ion_us":1408841},{"operation_name":"query","last_dispatch_duration_us":1362739,"last_remote_socket":"db2.xxxx.dk:8093","last_local_socket":"10.42.208.10:4824","total_dispatch_duration_us":1362739,"timeout_ms":100000,"total_duration_us":1362839},{"oper
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: tion_name":"query","last_dispatch_duration_us":1060293,"last_remote_socket":"db2.xxxx.dk:8093","last_local_socket":"10.42.208.10:4554","total_dispatch_duration_us":1060293,"timeout_ms":100000,"total_duration_us":1060430},{"operation_name":"query","las
[012918:004955-00007FCE84CB3700] 16-04-2024 12:29:09   HTTP JVM: _dispatch_duration_us":145915,"last_remote_socket":"db2.xxxx.dk:8093","last_local_socket":"10.42.208.10:4230","total_dispatch_duration_us":145915,"timeout_ms":100000,"total_duration_us":1055297}],"total_count":9}}

If I connect to the two database servers directly they seem “happy” with around 20% CPU usage and 80% memory usage. Queries are quick.

The application itself is slow - and I am just wondering if there is a bottleneck in the connection from the application server to the database servers… Normally, everything runs fast - however, this morning we have more users than normally - and then everything seems to hang.

Issues we have looked into now was number of threads on the web server where the application runs) - could there be similar concerns related to the SDK?

I am on 7.2.4 CE (on CentOS7) and using Java SDK 3.5.3. The server is running Java 1.8

Thanks in advance for any insights!

/John

In the OverThreshold Report there are 9 entries on the 120s period being reported on. They look like the have total_duration_us less than 2000000 (2 seconds) and the total_dispatch_us is almost equal that total_duration_us (i.e. little time is spent elsewhere in rhe call). And total_dispatch_us is the same as last_dispatch_us, so requests are not being retried. So it all.looks good, assuming that the last_dispatch_us is correct for those queries. You can check completed_requests for details on the queries Monitor Queries | Couchbase Docs

Thanks! I’ll check the doc :+1: