Couchbase Server Tuning

Couchbase Tuning

My single node server is running Couchbase 3.0.3-1716 on Centos 6.7 with 40 HW threads (20 cores*2 hw threads each), 32GB Ram and 64GB storage. It can currently reach max throughput of 200K ops/sec. With 1 client machine, it can reach 120K ops per sec. Using 2 client machines, it can reach 200K ops/sec. But when I use 3 or more machines, it can never go beyond 200K ops/sec.

Below are the sysctl configurations I have tried
Config 1:
vm.swappiness = 0
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 200
net.ipv4.tcp_sack = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_intvl = 30
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_mem = 134217728 134217728 134217728
net.ipv4.tcp_rmem = 4096 277750 134217728
net.ipv4.tcp_wmem = 4096 277750 134217728
net.core.somaxconn = 40000
net.core.netdev_max_backlog = 300000
net.ipv4.tcp_max_syn_backlog = 40000

Config 2:
net.ipv4.tcp_rfc1337 = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_workaround_signed_windows=1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_fack = 1
net.ipv4.tcp_low_latency = 1
net.ipv4.ip_no_pmtu_disc = 0
net.ipv4.tcp_mtu_probing = 1
fs.file-max = 50384
net.core.netdev_max_backlog = 300000
net.ipv4.tcp_moderate_rcvbuf = 1
net.core.somaxconn = 40000
net.ipv4.tcp_max_syn_backlog = 40000
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_tw_reuse = 1
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.optmem_max = 134217728
net.ipv4.tcp_mem = 134217728 134217728 134217728
net.ipv4.tcp_rmem = 4096 262144 134217728
net.ipv4.tcp_wmem = 4096 262144 134217728
vm.swappiness = 0
vm.dirty_bytes = 209715200
vm.dirty_background_bytes = 104857600
vm.zone_reclaim_mode = 0

My common settings are below
ext4 filesystem with options rw,noatime,barrier=0,data=writeback
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo deadline > /sys/block/sda/queue/scheduler
echo 1024 > /sys/block/sda/queue/nr_requests
selinux disabled

ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 999999
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
real-time priority (-r) 0
stack size (kbytes, -s) 244
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

None of the configurations has gone beyond 200K ops/sec. My total CPU usage has never gone beyond 15%. My CPU distribution was okay as none reached more than 40%. My total RAM usage was just less than 10%. My disk usage was just 5%.

What else am I missing? Why cant I make it go beyond that 200K ops/sec limit while none of my HW limits is being reached yet? Please help as we really want to utilise each machine in the cluster.

Thanks in advance!

What client(s) application and SDK are you using to drive the workload? I suspect that the servers simply aren’t being pushed hard enough. Is your application sync or async?

If you use a benchmark program like cbc-pillowfight (part of the Couchbase C SDK tools) and run that against your server instead of your application (configured for similar document count / size), what numbers do you see?

Thanks for the reply. We are using nodejs clients we made ourselves using async. I don’t think it is the client because we are using different machines. We run 38 instances of that nodejs client on each machine with affinity configuration. And as mentioned, 1 machine gives 120K, 2 machines give 200K, 3 or 4 machines give 200K also. Each instance across the test system access unique key so there is no race condition throughout the process.

Maybe, but it’s probably worth testing to determine for sure if it is the client or server.

But we’ve seen the ops/sec scale adding 1 instance of the client at a time. It scales almost linearly until we hit that 200K ops/sec.

Ok. I have done testing using cbc-pillowfight using the command below
cbc-pillowfight -t 38 -U couchbase://ourip/oubucket
One machine make the couchbase server reach 340K ops/sec. Running the same command on second machine made the couchbase ops/sec close but a little bit lower than 340K ops/sec. Running on the third machine made the throughput even a little bit lower at 320K ops/sec. And the current utilisation is just at 15% CPU, 10% RAM and disk usage is still at 5% (no disk transaction at all, the same as using our own test app).
Now I go back to the same question. Why cant we push it even higher? It seems that cbc-pillowfight is having a little bit higher throughput because of document size and complexity of the operations. Our was having atomic operations too. Clearly we are hitting a limitation on the server side that we cannot explain.
We have bonded 2 of 10Gb ethernet ports and they are connected through the same switch. Transfer between machines is a lot higher bandwidth than this so it could never be the network also.
Please help identify what we can do to increase the throughput of couchbase to fully utilise the hardware without using virtual machines. Thanks in advance!

For pillowfight try tuning the batch size - basically you need to consider the bandwidth delay product between the client and server.

Also - try pushing the thread count even higher - with 40 hardware threads you’ll have 40 *0.75 = 30 front-end threads on Couchbase (per server) so 38 client threads are unlikely to be able to saturate the server.

Server (cb server) is in 1 machine with 40 threads. Client 1 (cbc-pillowfight) in another separate machine with 40 threads. So as client 2 and client 3. Even if client 1 is not saturated, client 2 and client 3 should at least make throughput of server higher.