Disk write spikes & performance degradation

Hello,
we are seeing occasional performance issues with one of out buckets, maybe somebody can point me in the right direction?

4-node cluster, 256GB of RAM per server (Physical), SSD disks, CentOS. Used as a pure key/value store (no indexes, no queries). Some kernel tuning (TCP stack, buffers, tuned profiles, disk alignment, etc.) Other CB settings left to default.

We have a few buckets on the cluster, the largest is ~10 TB, used to store static web page objects. Overall usage is quite low, we hover around 40K operations per second, mostly on a bucket used for short-term content caching (~20K operations per second on this one alone).

“Sometimes” ( once-twice per month) we see response times jump from a few microsecs to tens of milliseconds. If I look at the statistics during that period, I see very high disk write queues on some (typically only one) of the servers in the cluster (e.g, “disk write queue” = 300-600K, staying at that for one hour or so). During this time “BG wait time”, on this server, goes up to 40K, and the client application sees response times 1000x higher than normal (see screenshot below for other counters).

If I look at bucket usage, I do not see any burst in read or write activity coming from the clients. SAR statistics show the disks on this server going very busy during the incident (about 1M sectors/second read/written, 50ms wait times, 100% disk utilization).
I had a look at the CB logs on this server, but nothing pops out. or maybe I don’t know where to look

Any idea what could be the cause of this disk activity?

Many thanks!

Which version are you running? Does it correlate to compaction?

Hi, thanks for replying.

this is on 4.5.0-2601 Community Edition (build-2601).

Compaction runs quite frequently on that cluster, some of the buckets have a relatively high write volume [does not take long for a significant portion of the keys to be re-written). Over the hour of the incident I can see it kicking in a few times, but no warning of note in the log.

Anything specific I should look for?

There’s certainly nothing which should run at a period of once/twice a month by default.

The compactor (as Matt suggested) is a common cause of increased disk write spikes (and indeed it’ll cause the disk write queues to increase as they are used to “hold” incoming writes while compaction is running).

Note the default number of parallel compactors was changed from 3 to 1 in 4.5.1 (see MB-18426) - you might want to experiment with the diag/eval mentioned there, but note that will reduce your compaction throughput (and hence potentially increase disk space requirements).

Thanks for the suggestion. I’ll play with these parameters tomorrow. Given that it happens quiet rarely, it might be a while before I can see the results though.

Thanks again.

how is that works? I came up with the same problem, after change the param, the ssd io util is still high(up to 98%), which is still make request queued and cause long time cost.

I upgraded the clusters to 4.5.1, but in my case this issue would pop up once or twice a month. I’ve not seen it happen yet, but it might be too early to celebrate.

what edition are you using? open source edition, community or enterprise? @paolop

We’re upgrading them all to 4.5.1-2844 Community Edition (build-2844)

Do you compile couchbase server from source code?

Nope, using the CentOS packages form CB’s website

Just to wrap this up, the issues we were seeing with large disk spikes have gone away after the upgrade to version 4.5.1.

Good to hear. Would you mind marking this topic as solved if that’s the case? - thanks.