High Latency in Couchbase Java SDK for Light KV LookupIn Workload Despite Healthy Server Performance

Hi,

I’m running a Spring Boot reactive application using Couchbase client-core v3.8.1 and java-client v3.7.9, and I’m observing unexpectedly high latencies at the SDK layer — typically >250ms per request, even though:
Couchbase server is healthy (low CPU, fast response times)
SDK logs show OverThresholdRequestRecordedEvent warnings
Workload is light: only 100 RPS, each request performs 25 lightweight lookupIn() sub-document operations
I’m currently testing with a single pod (4 vCPU), and expected latencies would be under 100ms, but that’s not happening.
I’ve tuned eventLoopGroup, kvConnection, and thread pools as per documentation and Couchbase best practices (detailed below in configuration), but still see high latencies from the SDK.
My Pod and couchbase server are in the same region and size of returned lookup document is <1KB.

Can anyone point out what I might be missing or suggest advanced tuning tips for reactive SDK use with high-QPS KV subdoc workloads?

Thanks in advance!

Workload Overview

Environment: Java 17, Spring Boot, Linux (Epoll available)
Couchbase SDK version: 3.7.9
Pod spec: 4 vCPU, 4Gi memory
RPS: ~100 req/sec
Each request processes 25 products.
For each Product, a lookupIn() call is made → ~2,500 KV lookups/sec per pod
Using Reactive APIs with collection.lookupIn() (non-blocking)

@PostConstruct
    public void initialize() {
        scheduler = Schedulers.fromExecutor(Executors.newWorkStealingPool(8));
        if (Epoll.isAvailable()) {
            eventLoopGroup = new EpollEventLoopGroup(8);
        } else {
            eventLoopGroup = new NioEventLoopGroup(8);
        }
    }

@Override
    protected void configureEnvironment(final ClusterEnvironment.Builder builder) {
        builder
            .jsonSerializer(JacksonJsonSerializer.create(objectMapper))
            .meter(MicrometerMeter.wrap(meterRegistry))
            .timeoutConfig((config) ->
                config.connectTimeout(Duration.ofMillis(1000))
                    .kvTimeout(Duration.ofMillis(1000))
            )
            .ioConfig(
                (config) ->
                    config.tcpKeepAliveTime(Duration.ofMillis(60000))
                        .numKvConnections(8)
                        .enableDnsSrv(false)
            )
            .publishOnScheduler(() -> scheduler)
            .ioEnvironmentConfig().kvEventLoopGroup(eventLoopGroup)
            .build();
    }


     public Mono<ProductResult> fetchPriceCompetitiveness(final String product, final List<String> locations) {
        List<LookupInSpec> specs = buildLookupSpecs(locations);
        return collection.lookupIn(product, specs)
            .onErrorResume(DocumentNotFoundException.class, e -> Mono.empty());
    }

Can you start with a single request before load testing? If you set the kv threshold to 1ms, that will likely report every request. If a single request taking 500 ms ( due to network or something else) then load testing will not get 100ms.

How are you measuring latency?

Can you show the overthreshold reports?

Can you show the reactor flow (what is calling
fetchPriceCompetitiveness)

Is the client cpu saturated? The sdk can only send (and receive) requests as fast as the machine can.

How many specs in your lookupIn?

Have you tried it without micrometer?

Have you tried with rhe default scheduler and eventgroup? If a customizations doesn’t help, then revert the customization. Unhelpful customizations just means more variables to rule out.

hi @mreiche

Here are some more details to share,

How are you measuring latency? - measuring in splunk(for only CB response times) and also in new relic(overall http response times).

Can you show the overthreshold reports?, PFB reports

[com.couchbase.tracing][OverThresholdRequestsRecordedEvent][10s] Requests over Threshold found: {“kv”:{“top_requests”:[{“operation_name”:“lookup_in”,“last_dispatch_duration_us”:684223,“last_remote_socket”:“10.118.123.13:11210”,“last_local_id”:“64A4466800000001/000000008E5C88B8”,“last_local_socket”:“10.124.50.192:46250”,“total_dispatch_duration_us”:684223,“total_server_duration_us”:19,“operation_id”:“0x32547”,“timeout_ms”:2000,“last_server_duration_us”:19,“total_duration_us”:684249},{“operation_name”:“lookup_in”,“last_dispatch_duration_us”:664550,“last_remote_socket”:“10.118.123.13:11210”,“last_local_id”:“64A4466800000001/000000008E5C88B8”,“last_local_socket”:“10.124.50.192:46250”,“total_dispatch_duration_us”:664550,“total_server_duration_us”:11,“operation_id”:“0x32557”,“timeout_ms”:2000,“last_server_duration_us”:11,“total_duration_us”:664595},{“operation_name”:“lookup_in”,“last_dispatch_duration_us”:644152,“last_remote_socket”:“10.118.123.13:11210”,“last_local_id”:“64A4466800000001/000000008E5C88B8”,“last_local_socket”:“10.124.50.192:46250”,“total_dispatch_duration_us”:644152,“total_server_duration_us”:11,“operation_id”:“0x32579”,“timeout_ms”:2000,“last_server_duration_us”:11,“total_duration_us”:644170},{“operation_name”:“lookup_in”,“last_dispatch_duration_us”:624161,“last_remote_socket”:“10.118.123.13:11210”,“last_local_id”:“64A4466800000001/000000008E5C88B8”,“last_local_socket”:“10.124.50.192:46250”,“total_dispatch_duration_us”:624161,“total_server_duration_us”:11,“operation_id”:“0x3258b”,“timeout_ms”:2000,“last_server_duration_us”:11,“total_duration_us”:624196},{“operation_name”:“lookup_in”,“last_dispatch_duration_us”:504102,“last_remote_socket”:“10.118.123.13:11210”,“last_local_id”:“64A4466800000001/000000008E5C88B8”,“last_local_socket”:“10.124.50.192:46250”,“total_dispatch_duration_us”:504102,“total_server_duration_us”:11,“operation_id”:“0x32626”,“timeout_ms”:2000,“last_server_duration_us”:11,“total_duration_us”:504126}],“total_count”:5}}

Can you show the reactor flow (what is calling
fetchPriceCompetitiveness) - PFB code,

    private Flux<PriceCompetitiveness> fetchPriceCompetitivenessWithClusters(final Map<String, List<String>> gtinMap) {
        return Flux.fromIterable(gtinMap.entrySet())
            .flatMap(entry -> Mono.defer(() -> getPriceCompetitiveness(entry.getKey(), entry.getValue())))
            .doOnError(error -> log.atError()
                .setMessage("Unexpected error fetching competitiveness score")
                .setCause(error)
                .log()
            )
            .retryWhen(couchBaseConfiguration.createRetry(log));
    }

Is the client cpu saturated? - NO we have underutilised CPU

How many specs in your lookupIn?
2 specs currently.

Regards,
Venkat

Nothing sticks out.

Did you try without micrometer?

Can you show the code that calls fetchPriceCompetitivenessWithClusters? And the code that calls that?

And the metrics (or overthreshold) for a single request by itself?

Hi @mreiche

Yeah without registering micrometer in CouchbaseConfiguration, I didn’t see any improvement in latencies.

private Flux<PriceCompetitiveness> processGtinsWithClusters(final Map<String, List<String>> gtinMap) {
    List<Map<String, List<String>>> subMaps = regroupGtins(gtinMap.keySet(), parallelGroups).stream()
        .map(group -> group.stream()
            .collect(Collectors.toMap(Function.identity(), gtinMap::get)))
        .toList();

    return Flux.fromIterable(subMaps)
        .flatMap(this::fetchPriceCompetitivenessWithClusters, subMaps.size());
}

private List<Set<String>> regroupGtins(final Set<String> gtins, final int maxParallelism) {
    //regroup logic to regroup gtins in equal number of batches based on maxParallelism
}

private Mono<PriceCompetitivenessResponse> fetchProductCompetitiveness(final List<PriceCompetitivenessRequest> formattedRequests) {
    Map<String, List<String>> gtinClusters = groupByGtinToUniqueLocationClusterIds(formattedRequests);
    Timer timer = new Timer();
    return repository.processGtinsWithClusters(gtinClusters)
        .filter(competitiveness -> Objects.nonNull(competitiveness.getScores()))
        .collectList()
        .doOnSuccess(list -> log.atInfo().setMessage("Overall Time taken by Couchbase")
            .addKeyValue(TIME_TAKEN, timer.timeSpent()).log())
        .map(competitivenessData -> mapToResponse(formattedRequests, competitivenessData));
}

For a single request,

[com.couchbase.tracing][OverThresholdRequestsRecordedEvent][10s] Requests over Threshold found: {“kv”:{“top_requests”:[{“operation_name”:“lookup_in”,“last_dispatch_duration_us”:527175,“last_remote_socket”:“10.118.123.11:11210”,“last_local_id”:“B731713C00000001/000000000FA5381D”,“last_local_socket”:“10.124.50.202:38600”,“total_dispatch_duration_us”:527175,“total_server_duration_us”:15,“operation_id”:“0x9496f”,“timeout_ms”:2000,“last_server_duration_us”:15,“total_duration_us”:527203}],“total_count”:1}}

Regards,
Venkat

That was just one request executed by itself? (i.e. not one request recorded out of hundreds?)
That has over half a second of what appears to be network latency. If you want lower total latency, you’ll need to reduce the network latency. Perhaps by putting your client closer to your cluster.