About Optimising Document Fetch From Couchbase using Spring Cochbase Repository

vaibhav123 · September 8, 2025, 6:18am

Hey People i am using couchbase as my database and my application is a springboot application and i am using couchbase Spring Repository findAllById method to get the couchbase document.
In a single call i have to fetch like 50 docs which are like have almost 50 key values per doc.

I have configured the KV_POOL thread to be like 8 as i have 8 core but i see couchbase is taking a lot of time to respond is it something like my requests are getting queeue in the the io kv threads that i am using?

I also read that the desirialsing can take time so i tried by migrating to raw transcoder but it also did not help.

At couchbase server end every resource is in check. I feel i am missing something at my application level while interacting with couchbase because of which i am not seeing good performance.

for reference this is how i am accessing the documents.

List hptpEntities = Flux.fromIterable(hptpDocumentIds)
.flatMap(id → collection.reactive().get(id).publishOn(Schedulers.boundedElastic()).map(result → Tuples.of(id, result)) // shifts decoding & mapping off Netty I/O threads
.onErrorResume(DocumentNotFoundException.class, ex → Mono.empty()))
.map(result → {
HPTPEntity httpEntity = stockJsonFormatter.deserializeData(new String(result.getT2().contentAsBytes(), StandardCharsets.UTF_8), HPTPEntity.class);
httpEntity.setDocumentId(result.getT1());
return httpEntity;
}) // safe here
.collectList()
.block();

graham.pople · September 8, 2025, 10:14am

What is your latency to the cluster, e.g. the time to do a ping or similar? That is often the dominating performance consideration. And what latencies are you seeing in your application?

mreiche · September 8, 2025, 3:07pm

Totally agree with @graham.pople . I would suggested measuring the client - server round-trip time with pillowfight, or GitHub - mikereiche/loaddriver and using that as a baseline. If everything is as it should be, the throughput will be limited by network bandwidth.

Spring-data-couchbase has it’s own de/serialization that cannot be bypassed. It will be significantly slower than using the Couchbase Java SDK directly. The good news is that the Couchbase Java SDK can be called directly from the couchbaseClientFactory of a spring data couchbase repository.

collection.reactive().get(id).

That isn’t findByAllId of a spring data couchbase repository.

stockJsonFormatter.deserializeData

And it looks like the data is indeed being deserialized.

mreiche · September 9, 2025, 5:05pm

Just as a rough point of reference - running loaddriver with both the client and the (single-node) server on my MacBook Pro I get the result below. I don’t know the size of your documents (messages), I used 1024. You’ll need to look in the code to verify how avg is computed, but it looks to me like with one thread, each batch is executed in avg x batchsize microseconds. For this particular run 474 x 50 = 23700 microseconds which is 23.7 milliseconds.

% java -jar target/loaddriver-0.0.1-SNAPSHOT.jar --url couchbase://localhost --username Administrator --password password --bucket travel-sample --runseconds 10 --threads 1 --batchsize 50 --messagesize 1024
…
count: 1050200, requests/second: 105020, max: 74813, avg: 474, rq/s/thread: 105020, nthreads: 1, nRequestsPerSecond: 0, kvEventLoopThreadCount: 0, runSeconds: 10, timeoutUs: 2500000, thresholdUs: 500000, gcIntervalMs: 0, nKvConnections: 2, messageSize: 1024, schedulerThreadCount: 0, batchSize: 50, execution: reactive, transcoder: json, virtualThreads: false, cbUrl: couchbase://localhost, bucketname: travel-sample, asObject: true, sameId: false, shareCluster: true, operationType: get, durability: NONE

vaibhav123 · September 10, 2025, 8:04pm

so @mreiche one more thing like i checked round trip it is fairly less. How can i minimise the deserialisation time ? i have kept desiralisation on a separate thread. it should be good.
Also wanted to ask that currently io threads for couchbase i have kept as 8 are they too much should i reduce them as my cores are also 8. How much does couchbase suggest to keep the io threads basically?

mreiche · September 10, 2025, 9:39pm

You could remove the deserialization completely to see if it is indeed the serialization that is taking time. If it is the serialization, you could try reduce the serialization time by trying a different serializer. Or delay serialization by programming more “reactively”. For instance, if these 50 documents are to be displayed with 5 on documents on each page, its necessary to only deserialze the first 5 before displaying tge first page.

If only parts of the documents is needed, it is possible to only fetch only those parts of the document.

If you posted rhe measurements you are getting and the size and number of the documents, it might be easier figure out what can be done.

graham.pople · September 12, 2025, 10:44am

so @mreiche one more thing like i checked round trip it is fairly less. How can i minimise the deserialisation time ? i have kept desiralisation on a separate thread. it should be good.
Also wanted to ask that currently io threads for couchbase i have kept as 8 are they too much should i reduce them as my cores are also 8. How much does couchbase suggest to keep the io threads basically?

@vaibhav123 what is ‘fairly less’ in cold hard numbers?

If it is still in the multiple milliseconds range then I would suggest focussing on reducing this as the priority, before starting to look at deserialization optimisations and particularly IO threads. Network latency will generally be the dominating factor, and there are techniques such as VPC peering you can use to minimise it.

I’m not saying deserialization is irrelevant - but personally I’d be looking at it only after getting the network as fast as possible. Not least because there may not be much you can do about it - things have to be deserialized.

One other thing you can experiment with is to increase the numKvConnections SDK parameter, which will allow more concurrent connections to KV. KV connections are already multiplexed, but it might help regardless.

vaibhav123 · September 16, 2025, 9:12am

so my network round trip or ping tests to couchbase from my server at that load generally takes like 2 to 3 ms but one interesting thing i see is that couchbase Json parsing threads are mostly in parked states

graham.pople · September 16, 2025, 9:36am

How can i minimise the deserialisation time

couchbase Json parsing threads are mostly in parked states

From your code snippet

HPTPEntity httpEntity = stockJsonFormatter.deserializeData(new String(result.getT2().contentAsBytes(), StandardCharsets.UTF_8), HPTPEntity.class);

you won’t be using the Couchbase JSON deserializer at all as you’re using contentAsBytes(), and instead are using a custom stockJsonFormatter.

So there’s nothing further you can do in terms of Couchbase to minimise deserialisation time - we will already be doing the bare minimum required to parsed the protocol packets and give you the raw bytes directly. No JSON decoding will be happening inside the SDK. You’ll need to look at this non-Couchbase stockJsonFormatter to see if it’s possible to deserialize any faster. OOI - how large are your documents?

What times exactly are you seeing when fetching the 50 docs? If you are seeing round-trip-times of 2-3 milliseconds, then I’d expect your code to be also fetching all 50 in around 2-3 milliseconds, since it seems to be parallelising effectively with Flux.flatMap (should do 128 operations in parallel). If you take the stockJsonFormatter.deserializeData out of your code, how much faster is it?

vaibhav123 · September 16, 2025, 11:00am

So i see my p99 for my network call from my service to couchbase ( This is excluding deserialisation) is taking 12 ms.
I fire 50 requests ( one to get eeach doc) and all asynchronously.
i have io kv threads as 2 and number of connections =8.

Is 12 ms a high time. And what can be the reasons for it

vaibhav123 · September 16, 2025, 11:18am

Just to add more context
long start = System.nanoTime();
return collection.reactive().get(id)
.publishOn(loomScheduler)
.map(result → {
long networkMs = (System.nanoTime() - start) / MS_DIVISIOR;
log.info(“[CB-NETWORK] networkFetchTime={} ms”, networkMs);
return Tuples.of(id, result);
})

This is the time i am capturing and now under 5k TPS it is going like 55ms p99.

At the same test i did ping test to couchbase and it comes out to be avg around 2ms

graham.pople · September 16, 2025, 12:41pm

So to confirm:

Your RTT from appserver to the cluster, measuring with ping, is 2ms.

If you do a single batch of 50 parallel KV gets, with no other traffic going on, you see around 12ms total for that batch, at p99. With numKvConnections==8 and IO KV threads == 2.

And if you do 5k continuous ops per second, the time for an individual KV get goes to around 55ms p99.

Is all that correct?

If so, then yes, it seems like you’re bottlenecking somewhere. 50 parallel gets is very little and per above I’d expect it to take not much more than the RTT (2ms).

5000 continuous parallel ops is also not all that much (when everything is sized appropriately for that) and I would not expect the KV latency to drop to ~30 x RTT.

There’s only so much we can do to help out with in-depth performance issues like this on the forums though. We’ve gone beyond the basic checks now (checking the application is parallelising well, checking network ping, trying numKvConnections, taking deserialization out of the equation), and next steps would involve checking things like cluster metrics (to make sure nothing is saturated there and the cluster is well sized) and application metrics. If you need help with that, I’d recommend contacting the experts at Couchbase professional services or technical support to discuss further.

Mike’s also shared a loadgen app above that you could run, so you have another datapoint.

mreiche · September 16, 2025, 4:26pm

Edit: I ignored an important factor which is compression. Requests from and responses to the SDK will be compressed if they are over a certain size and if the compression ratio exceeds a threshold. ( loaddriver avoids compression by using data which does not compress well)

I’m not exactly what that 12ms is. It would help if you showed how it is being measured. Since all the requests are initiated at the same time (more-or-less) and 12 ms is the ‘high’ (i.e. maximum), then the last request would complete in 12ms (?). So all the requests would complete in 12ms. That’s over 4000 requests per second. I don’t see where you mentioned how many data nodes are in your cluster, but in general, if there is a bottleneck and the client is not saturated and the network is not saturated - adding data nodes will improve performance.

The size of the data will affect the time. Even if you have 0ms latency, it will still take (size-in-bytes * 8 ) / bits-per-second seconds. To transfer data over a network. So if you have 20MB documents and a 1Gbit network with 50% efficiency, and 50 documents, it will take 50 * 20MB * 8 / 500,000,000 seconds or (50 * 20,000,000 * 8 ) / 500,000,000 = 16 seconds to transfer all 50 documents. (or 16/50 = 0.32 seconds per document)

Noticing that the number of documents (50) cancels out (as it should) - If your documents are 200k bytes in size, then (200,000 * 8) / 500,000,000 = 0.0032 seconds (3.2 ms). So 12ms maybe slow or it maybe fast depending on the size of the documents. And of course my formula is only for network time and ignores other things like processing on the cluster and processing on the client. So loaddriver (mentioned earlier) measures response times and takes into account all aspects, and allows varying the number of threads, the document size, number of connections, the de/serialization and other things.

While moving deserialization to another thread will use another thread - threads still need CPU (core) to execute, and the number of cores is not unlimited.

mreiche · September 16, 2025, 5:21pm

Edit: I fixed this to use data that is (a) json; and (b) not compressible

This code running against Capella with 200 kb messages, gives total et 4500 ms.

	bucket.waitUntilReady(Duration.ofSeconds(30));

	String docId = "1234";
	JsonObject doc = JsonObject.create();
	Random r = new Random();
	char[] chars = new char[200*1024];
	for(int i=0; i < chars.length; i++){
		chars[i]= (char)r.nextInt(256);
	}
	doc.put("data", JsonObject.create().put("key", String.valueOf(chars)));
	collection.upsert(docId, doc);

	List<String> hptpDocumentIds = new LinkedList<>();
	for (int i = 0; i < 50; i++)
		hptpDocumentIds.add(docId);

	long t0 = System.nanoTime();
	List hptpEntities = Flux.fromIterable(hptpDocumentIds)
			.flatMap(id -> collection.reactive().get(id, GetOptions.getOptions().timeout(Duration.ofSeconds(10))).publishOn(Schedulers.boundedElastic())
					.map(result -> Tuples.of(id, result)) // shifts decoding & mapping off Netty I/O threads
					.onErrorResume(DocumentNotFoundException.class, ex -> Mono.empty()))
			.map(result -> {
				JsonObject httpEntity = result.getT2().contentAsObject();
				httpEntity.put("id", result.getT1());
				return httpEntity;
			}) // safe here
			.collectList().block();
	long et_ns = (System.nanoTime() - t0) ;
	System.err.println("et         : " + et_ns / 1000000 + " ms, "
    +"rate: " + (hptpEntities.size() / (et_ns / 1000000000.0) ) +" /second");

vaibhav123 · September 17, 2025, 4:41am

thank you this was helpful but just to add to you we have 8 couchbase Nodes (We monitored all metrices there are in check) and when i say 5k TPS. so we have 20 app servers so one one server the request is hardly 250 TPS.
The app servers are 8 core and have 16GB RAM. Our documents are not more than 20~50KB each.
Also when i say how i measure the network time i have pasted the same above ( the code snippet)

Also when we see the overThresholdRequests to couchbase i see the overThresholdRequests in most of the cases show the dispatch duration higher. Does that mean the client(which is our app server) is taking long to dispatch requests if yes we are failing to identify the bottleneck exactly. Because as per the suggestion i have kept io-KV threads very minimum like 2. And even tcp connections are 2. so my each app server would be forming like 2 connections with each of the 8 couchbase nodes. I tried increasing that number as well but no gain.

graham.pople · September 17, 2025, 10:28am

@vaibhav123 please note that per the licence terms

The Community Edition license provides the free deployment of Couchbase Community Edition for departmental-scale deployments of up to five node clusters.

For an 8 node cluster, you need to be on Enterprise Edition, and can then use the expertise of Couchbase technical support to work through any performance bottlenecks.

(nb 20~50KB documents are reasonably large, will be worth mentioning that on your ticket as it may be worth drilling into Snappy compression and the custom stockJsonFormatter further.)

mreiche · September 17, 2025, 4:10pm

Interesting. So you have 50 requests - let’s say on average (50kb-20kb)/2 = 35kb, fired off all at once and the one that takes the longest completes in 12ms. So they all complete in 12ms. Now I’m wondering what you expected and why (or how). Also - if the maximum response you are measuring is 12ms, then there will not be OverThreshold events - as the threshold is is 500 ms. So something is not quite right there.

Ignoring compression, the data on the network would be 35kb * 50. And in 12ms that would be 145Mbyte/second. With 8 bits/byte that’s 1.2 Gbits. With 50% efficiency that would take all of a 2.4Gb network. So maybe your network is saturated.

Topic		Replies	Views
Couchbase performance issue - slowness Java SDK	5	3479	October 4, 2021
Couchbase Get operaion slows down with increased thread Couchbase Server	2	2611	July 3, 2017
getBulk performance issue Java SDK	6	3576	March 30, 2015
Vert.x couchbase sdk performance Java SDK java	8	1513	September 13, 2020
Improving document retrieving velocity Couchbase Server	8	1795	September 15, 2016

About Optimising Document Fetch From Couchbase using Spring Cochbase Repository

Related topics