Have you guys benchmarked the performance at all of the new java client vs the old spymemcached one? I’m seeing a decent reduction in performance, and while it’s too early for me to say for sure that it’s slower (need to dig into my new code and make sure I’m not the problem) it sure seems that way. So just asking if you guys think the performance is on par with that of the old client.
FWIW, I had ripped out the outer most layer of spymemcached (the one that produces blocking java Futures) and replaced it with code that produces non-blocking Scala Futures (via Promises) instead. When I removed the blocking from gets on spymemcached I got excellent throughput, up to 100K ops per second in some cases. I’m seeing no way near this with the new client. I know spymemached “pipelined” multiple requests in a single byte array to reduce I/O costs and they got a lot of extra throughput from that. Is this something you guys are doing too?
@cbax007, in short yes we did. We are far from finished optimizing the new 2.0 series (since its a brand new codebase compared to the spy underworks), but it already performs very well from our perspective.
You should be able to get it surely over 100k ops/s, but of course there are lots of factors that are playing into this. We are actually now more efficient with cramming data into one tcp packet in the way how the Request RingBuffer is picking data up.
I can certainly go into much detail here, but I’d love to hear what code you are benchmarking, in what environment that is and what numbers you get/expect. The more info I have the more I can help, but be assured that we (especially I’m) benchmarking the client on a regular basis with different workloads, also because customers and users are asking and evaluating us in POCs. But again, numbers are relative. Let me know what you want/expect and we’ll get you there.
@daschl, thanks for getting back to me. Our load test environment is running CentOS 6.5 and is virtualized using OpenStack. My test code is running in a Java 7 runtime and using Scala 2.11 on top of that (and Akka 2.3.7). The code and couchbase server are on two separate VMs but I believe those two VMs are on the same iron (which should help with the I/O). Each of those VMs are setup to have 4 virtual cpus. The load comes in via JMeter which is making HTTP calls to a few of our REST endpoints utilizing 125 total threads within JMeter. JMeter is sort of the throttling factor as each thread won’t launch another request until it’s previous response has come back. The service code servicing those endpoints is Akka and a lot of that code is hitting couchbase for our caching layer.
As I mentioned in my initial post, we were using a hybridized spymemcached impl where I took out the outer layer so I could get non-blocking Scala futures returned. Using this setup, with the 125 threads kicking in the load, I was routinely seeing around 8-9K ops per second on my cache bucket with the 95% of the ops completing in under 21ms. Once I changed our code to use the new library, the ops per second in the couchbase bucket graph dropped to around 4-5K ops per second and the 95% op time went up to 45ms or so.
Also, outside of JMeter, I wrote a little code based test harness to try and maximize load that went right to our couch client. I ran that on my MacPro (8 core, 16GB RAM) to a local couchbase and that’s where I was routinely seeing 100K ops per second. When I ran this same test with the new code that number dropped quite a bit down to about 25-35K ops per second.
I’m very open to this being my code that’s causing the problem. I’m using BinaryDocument for everything and maybe I’m not being very efficient with how I create ByteBufs. Also, I wrapped RxJava with RxScala so I could get a little more scala friendly support on the Observables. All of these things could be playing a factor.
Let me know what else you need from me or if you have some optimization tips for me. I appreciate your help with this.
@daschl, I can share whatever code you are interested in seeing, putting it out into a public repo in github. Which code do you want to see though? Do you want to see the new code where I’m now interacting with couch via the new api or do you want to see the old code where I was using the old couchbase client that was spymemcached based? Or are you looking to see what my load test harness looks like? Just let me know what you are looking for and I’ll try and get it together in a git repo to share.
@cbax007 awesome. Yes, I’m mostly interested in the new code that runs too slow. I want to figure out what’s the bottleneck and how far I can get it. We can take it from there then if we need more info to follow up.
I think the code which is isolated is probably the easiest to start out with.
@daschl, I’ve been profiling since my earlier post and one culprit I can see that’s causing issues in performance is my liberal use of the timeout combinator on observable. In my model, for a get for example, once I get the main observable returned from a call to bucket.get, I layer in the call to timeout before subscribing. I actually layer a few more combinators in (flatMap, map, elementAtOrDefault) to give various functionality, but as I removed these and profiled each time, they were not the problem. When I removed the timeout call, things started to perform better. I want my observables to emit an error downstream if they are not completed before a certain time and timeout seemed the best way to achieve this. Is there a better way to do this? Do I need to give timeout a different Scheduler or something?
In general I think placing a timeout at the very end, but having other various error handlers as tight as possible is a good idea. Of course sometimes you want to have more timeouts throughout the observable, but in my experience most of the time you want “do this whole thing, but make sure it takes no longer than N seconds”. Of course, if you wan to control different flows more tightly and retry/go somewhere else things are a little different.
@daschl, that helped out immensely in my little profiling sample. I pulled in rxjava 1.0.3 and that seems to have fixed the issue. I’m going to commit my changes and then let my nightly load test run tonight and verify the results tomorrow. I think this was a big part of why things were much slower. I bet I have a few more changes to make, but I’m hoping this uncorks things enough that we are good with the results. I’ll let you know how things look tomorrow. Thanks for the help with this.
No worries, good to know. Actually with that info I’ll make the call to upgrade 2.0.3 (slated for january) to ship with 1.0.3 so that not more people are running into it. Note that it does not fix it for Java 6 (which we support), but its a first good step.
Please let me know if you find bottlenecks in our code, I’m always looking to make it faster.
@Sam_K one important thing that when you are using the BinaryDocument is that you always manually need to release the content() ByteBuf. Otherwise it will leak in the netty pool and lead to memory issues.
Can you dump your heap on OOM and send it over? So we can look at the top offenders and see what is causing the pressure.