Hey People i am using couchbase as my database and my application is a springboot application and i am using couchbase Spring Repository findAllById method to get the couchbase document.
In a single call i have to fetch like 50 docs which are like have almost 50 key values per doc.
I have configured the KV_POOL thread to be like 8 as i have 8 core but i see couchbase is taking a lot of time to respond is it something like my requests are getting queeue in the the io kv threads that i am using?
I also read that the desirialsing can take time so i tried by migrating to raw transcoder but it also did not help.
At couchbase server end every resource is in check. I feel i am missing something at my application level while interacting with couchbase because of which i am not seeing good performance.
for reference this is how i am accessing the documents.
What is your latency to the cluster, e.g. the time to do a ping or similar? That is often the dominating performance consideration. And what latencies are you seeing in your application?
Totally agree with @graham.pople . I would suggested measuring the client - server round-trip time with pillowfight, or GitHub - mikereiche/loaddriver and using that as a baseline. If everything is as it should be, the throughput will be limited by network bandwidth.
Spring-data-couchbase has itās own de/serialization that cannot be bypassed. It will be significantly slower than using the Couchbase Java SDK directly. The good news is that the Couchbase Java SDK can be called directly from the couchbaseClientFactory of a spring data couchbase repository.
collection.reactive().get(id).
That isnāt findByAllId of a spring data couchbase repository.
stockJsonFormatter.deserializeData
And it looks like the data is indeed being deserialized.
Just as a rough point of reference - running loaddriver with both the client and the (single-node) server on my MacBook Pro I get the result below. I donāt know the size of your documents (messages), I used 1024. Youāll need to look in the code to verify how avg is computed, but it looks to me like with one thread, each batch is executed in avg x batchsize microseconds. For this particular run 474 x 50 = 23700 microseconds which is 23.7 milliseconds.
so @mreiche one more thing like i checked round trip it is fairly less. How can i minimise the deserialisation time ? i have kept desiralisation on a separate thread. it should be good.
Also wanted to ask that currently io threads for couchbase i have kept as 8 are they too much should i reduce them as my cores are also 8. How much does couchbase suggest to keep the io threads basically?
You could remove the deserialization completely to see if it is indeed the serialization that is taking time. If it is the serialization, you could try reduce the serialization time by trying a different serializer. Or delay serialization by programming more āreactivelyā. For instance, if these 50 documents are to be displayed with 5 on documents on each page, its necessary to only deserialze the first 5 before displaying tge first page.
If only parts of the documents is needed, it is possible to only fetch only those parts of the document.
If you posted rhe measurements you are getting and the size and number of the documents, it might be easier figure out what can be done.
so @mreiche one more thing like i checked round trip it is fairly less. How can i minimise the deserialisation time ? i have kept desiralisation on a separate thread. it should be good.
Also wanted to ask that currently io threads for couchbase i have kept as 8 are they too much should i reduce them as my cores are also 8. How much does couchbase suggest to keep the io threads basically?
@vaibhav123 what is āfairly lessā in cold hard numbers?
If it is still in the multiple milliseconds range then I would suggest focussing on reducing this as the priority, before starting to look at deserialization optimisations and particularly IO threads. Network latency will generally be the dominating factor, and there are techniques such as VPC peering you can use to minimise it.
Iām not saying deserialization is irrelevant - but personally Iād be looking at it only after getting the network as fast as possible. Not least because there may not be much you can do about it - things have to be deserialized.
One other thing you can experiment with is to increase the numKvConnections SDK parameter, which will allow more concurrent connections to KV. KV connections are already multiplexed, but it might help regardless.
so my network round trip or ping tests to couchbase from my server at that load generally takes like 2 to 3 ms but one interesting thing i see is that couchbase Json parsing threads are mostly in parked states
you wonāt be using the Couchbase JSON deserializer at all as youāre using contentAsBytes(), and instead are using a custom stockJsonFormatter.
So thereās nothing further you can do in terms of Couchbase to minimise deserialisation time - we will already be doing the bare minimum required to parsed the protocol packets and give you the raw bytes directly. No JSON decoding will be happening inside the SDK. Youāll need to look at this non-Couchbase stockJsonFormatter to see if itās possible to deserialize any faster. OOI - how large are your documents?
What times exactly are you seeing when fetching the 50 docs? If you are seeing round-trip-times of 2-3 milliseconds, then Iād expect your code to be also fetching all 50 in around 2-3 milliseconds, since it seems to be parallelising effectively with Flux.flatMap (should do 128 operations in parallel). If you take the stockJsonFormatter.deserializeData out of your code, how much faster is it?
So i see my p99 for my network call from my service to couchbase ( This is excluding deserialisation) is taking 12 ms.
I fire 50 requests ( one to get eeach doc) and all asynchronously.
i have io kv threads as 2 and number of connections =8.
Is 12 ms a high time. And what can be the reasons for it
Your RTT from appserver to the cluster, measuring with ping, is 2ms.
If you do a single batch of 50 parallel KV gets, with no other traffic going on, you see around 12ms total for that batch, at p99. With numKvConnections==8 and IO KV threads == 2.
And if you do 5k continuous ops per second, the time for an individual KV get goes to around 55ms p99.
Is all that correct?
If so, then yes, it seems like youāre bottlenecking somewhere. 50 parallel gets is very little and per above Iād expect it to take not much more than the RTT (2ms).
5000 continuous parallel ops is also not all that much (when everything is sized appropriately for that) and I would not expect the KV latency to drop to ~30 x RTT.
Thereās only so much we can do to help out with in-depth performance issues like this on the forums though. Weāve gone beyond the basic checks now (checking the application is parallelising well, checking network ping, trying numKvConnections, taking deserialization out of the equation), and next steps would involve checking things like cluster metrics (to make sure nothing is saturated there and the cluster is well sized) and application metrics. If you need help with that, Iād recommend contacting the experts at Couchbase professional services or technical support to discuss further.
Mikeās also shared a loadgen app above that you could run, so you have another datapoint.
Edit: I ignored an important factor which is compression. Requests from and responses to the SDK will be compressed if they are over a certain size and if the compression ratio exceeds a threshold. ( loaddriver avoids compression by using data which does not compress well)
Iām not exactly what that 12ms is. It would help if you showed how it is being measured. Since all the requests are initiated at the same time (more-or-less) and 12 ms is the āhighā (i.e. maximum), then the last request would complete in 12ms (?). So all the requests would complete in 12ms. Thatās over 4000 requests per second. I donāt see where you mentioned how many data nodes are in your cluster, but in general, if there is a bottleneck and the client is not saturated and the network is not saturated - adding data nodes will improve performance.
The size of the data will affect the time. Even if you have 0ms latency, it will still take (size-in-bytes * 8 ) / bits-per-second seconds. To transfer data over a network. So if you have 20MB documents and a 1Gbit network with 50% efficiency, and 50 documents, it will take 50 * 20MB * 8 / 500,000,000 seconds or (50 * 20,000,000 * 8 ) / 500,000,000 = 16 seconds to transfer all 50 documents. (or 16/50 = 0.32 seconds per document)
Noticing that the number of documents (50) cancels out (as it should) - If your documents are 200k bytes in size, then (200,000 * 8) / 500,000,000 = 0.0032 seconds (3.2 ms). So 12ms maybe slow or it maybe fast depending on the size of the documents. And of course my formula is only for network time and ignores other things like processing on the cluster and processing on the client. So loaddriver (mentioned earlier) measures response times and takes into account all aspects, and allows varying the number of threads, the document size, number of connections, the de/serialization and other things.
While moving deserialization to another thread will use another thread - threads still need CPU (core) to execute, and the number of cores is not unlimited.
thank you this was helpful but just to add to you we have 8 couchbase Nodes (We monitored all metrices there are in check) and when i say 5k TPS. so we have 20 app servers so one one server the request is hardly 250 TPS.
The app servers are 8 core and have 16GB RAM. Our documents are not more than 20~50KB each.
Also when i say how i measure the network time i have pasted the same above ( the code snippet)
Also when we see the overThresholdRequests to couchbase i see the overThresholdRequests in most of the cases show the dispatch duration higher. Does that mean the client(which is our app server) is taking long to dispatch requests if yes we are failing to identify the bottleneck exactly. Because as per the suggestion i have kept io-KV threads very minimum like 2. And even tcp connections are 2. so my each app server would be forming like 2 connections with each of the 8 couchbase nodes. I tried increasing that number as well but no gain.
The Community Edition license provides the free deployment of Couchbase Community Edition for departmental-scale deployments of up to five node clusters.
For an 8 node cluster, you need to be on Enterprise Edition, and can then use the expertise of Couchbase technical support to work through any performance bottlenecks.
(nb 20~50KB documents are reasonably large, will be worth mentioning that on your ticket as it may be worth drilling into Snappy compression and the custom stockJsonFormatter further.)
Interesting. So you have 50 requests - letās say on average (50kb-20kb)/2 = 35kb, fired off all at once and the one that takes the longest completes in 12ms. So they all complete in 12ms. Now Iām wondering what you expected and why (or how). Also - if the maximum response you are measuring is 12ms, then there will not be OverThreshold events - as the threshold is is 500 ms. So something is not quite right there.
Ignoring compression, the data on the network would be 35kb * 50. And in 12ms that would be 145Mbyte/second. With 8 bits/byte thatās 1.2 Gbits. With 50% efficiency that would take all of a 2.4Gb network. So maybe your network is saturated.