Vert.x couchbase sdk performance


We are seeing issues with the couchbase 2.7 latest version where it is not able to scale up.

When I hit the normal vertx health url returning a json, I get somewhere around 60k tps. If I replace it with reactive couchbase SDK get call, it comes down to 5k. The server machine on which application is running is having 1 core cpu and 4 GB RAM.

The couchbase cluster is 3 node cluster with 100% data in memory and there is no cpu increase on couchbase node during the test.

I can share a reproducer.

I would like to know how to get this resolved using this forum.

@himanshu.mps If you can log a bug here with the reproducer that would be great

Does this mean it performed differently with an earlier 2.7?

From my read of the description of your test, you were earlier returning some in memory structure directly and now you’re making a Couchbase get() call which obviously is going to involve some network IO, even if on localhost. My guess is you’re testing this with a simple loop fetching it as fast as you can?

If so, what you’re seeing is probably expected. The additional latency to fetching the item from Couchbase would drop your throughput. If you add more concurrency, your throughput would go up. There are more tuning options, but that’s more about optimization than understanding why you see a difference.

See also Little’s Law. Since the average wait time is relatively fixed (e.g., your network IO whether localhost or actual, plus processing time) adding concurrency means turning up the arrival time, which improves throughput until something deeper in the system becomes the tall pole.

My apologies.

This is the same behavior I am seeing across SDK’s. This has nothing to do with the current SDK version.

Attached is the code which can be used to test out the performance.

The key I am making is in this format “ARNG_CMP_ARR::IDENTIFIER::V1”.

I am not hitting multiple keys as I don’t see the CPU increasing on the node on which the vbucket holding the key. (7.6 KB)

Test 1:

[hgupt51@test02 wrk-master]$ ./wrk -c30 -t2 -L -d30s
Running 30s test @
  2 threads and 30 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   637.45us    1.24ms  23.02ms   95.18%
    Req/Sec    33.48k     5.30k   38.84k    93.17%
  Latency Distribution
     50%  390.00us
     75%  464.00us
     90%  588.00us
     99%    7.31ms
  1998699 requests in 30.00s, 211.58MB read
Requests/sec:  66613.83
Transfer/sec:      7.05MB
[hgupt51@test02 wrk-master]$ ./wrk -c30 -t2 -L -d30s
Running 30s test @
  2 threads and 30 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.02ms    1.69ms  25.48ms   90.07%
    Req/Sec     3.84k   663.85     4.39k    91.83%
  Latency Distribution
     50%    3.58ms
     75%    4.23ms
     90%    5.65ms
     99%   10.94ms
  229948 requests in 30.11s, 0.90GB read
Requests/sec:   7638.01
Transfer/sec:     30.46MB

Test 2:

[hgupt51@test02 wrk-master]$ ./wrk -c100 -t2 -L -d30s
Running 30s test @
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.12ms    0.96ms  24.88ms   96.25%
    Req/Sec    45.50k     4.59k   49.89k    91.67%
  Latency Distribution
     50%    0.99ms
     75%    1.11ms
     90%    1.28ms
     99%    4.22ms
  2717186 requests in 30.02s, 287.64MB read
Requests/sec:  90518.50
Transfer/sec:      9.58MB
[hgupt51@test02 wrk-master]$ ./wrk -c100 -t2 -L -d30s
Running 30s test @
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.83ms    2.63ms  41.51ms   71.30%
    Req/Sec     4.25k   270.29     4.63k    89.00%
  Latency Distribution
     50%   11.72ms
     75%   13.32ms
     90%   14.88ms
     99%   19.58ms
  253805 requests in 30.02s, 0.99GB read
Requests/sec:   8454.29
Transfer/sec:     33.71MB

Test 3:

[hgupt51@test02 wrk-master]$ ./wrk -c300 -t2 -L -d30s
Running 30s test @
  2 threads and 300 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.44ms    3.57ms 217.01ms   95.92%
    Req/Sec    43.86k     5.33k   55.92k    86.62%
  Latency Distribution
     50%    2.72ms
     75%    4.03ms
     90%    5.46ms
     99%   12.09ms
  2623829 requests in 30.08s, 277.75MB read
Requests/sec:  87218.75
Transfer/sec:      9.23MB
[hgupt51@test02 wrk-master]$ ./wrk -c300 -t2 -L -d30s
Running 30s test @
  2 threads and 300 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    44.21ms   11.55ms 255.77ms   92.60%
    Req/Sec     3.42k   543.65     4.40k    84.14%
  Latency Distribution
     50%   42.11ms
     75%   45.78ms
     90%   50.57ms
     99%   91.65ms
  204391 requests in 30.05s, 815.17MB read
Requests/sec:   6802.72
Transfer/sec:     27.13MB

Popping this up a level, are you trying to implement a health check? If so, fetching a single document probably isn’t the best way to do this.

We have a couple of health check APIs: ping() and diagnostics() that use noops and can fan out to different services on the system. You may want to check that out in the docs.

Also, if you’re hitting one key, you’re also forcing some overhead internal to the Data service to access that item. You may not see significant CPU usage because it’s all in memory and you may have a lot of cores, but guaranteed there is a hot lock with a busy loop like this.

Could you explain the difference between test 1, 2,3? What is different and what is outside what you expect to see?

The point I am trying to prove is that when you increase the number of connection, couchbase start taking time and that screws the overall metrics. The health checkis not doing any processing but still doing the JSON creation work but still able to give performance.

Now vertx and couchbase both having the reactive libraries and are meant for cloud, is this the best we can get with the constrained hardware requirement ?

The test I am running are on VM with 1 core CPU and 4GB RAM and the results that are shared are from the VM. I get worse performance when I run the same code on openshift 3.11 with pod having the same size as VM.

Is there any optimization that we can do( I am using JsonDocument and can still live with RawDocument) or any other considerations so that we can achieve at least 10000 tps with lets say 100 concurrent users with 99%tile being within 10 msec ? That way we can set the number of pods based on the number of connections that the application is going to get.

I ran with multiple keys and I don’t see much difference.

@ingenthr Any insights?

Does this mean you ran with multiple concurrent requests? If so, how many?

I’m glad to try to help, of course. I don’t have the time at the moment to read the code and try to repro it for you. We can try to guide you in your investigation.

This sounds like it should be quite doable, yes. I don’t think you need to change the transcoder just yet, unless you have profiled and know you’re CPU bound. My suspicion is still that the difference between your two environments is that the latency goes up (which is not unreasonable) and that drops the throughput on a small number of tight loops. But if you have your 100 concurrent users, the kind of throughput you describe seems quite doable.

Hope that helps.