Like clockwork, every 120 seconds our view performance tanks.
If I run a quick loop to send view requests over and over, it will return with 40ms to 150ms response times. Then for 10s (say from 1:02:12pm to 1:02:24pm). It will stop responding, and then return all the calls, with times taking 10000ms or more to respond.
I first noticed this while using the couchbase-client for java, but I went ahead and just wrote a quick ruby script to hit the rest-api for the view directly, using the same queries etc.
When the latency happens, the disk io climbs a bit, and the beam.smp process maxes out a core of the VM.
I have tried searching through the logs and don’t see anything that happens every 120s that is obvious to correlate to this.
Where could I look to find whatever is happening every 120s that is causing the CPU spike and most likely the view slowness?
Looks like this might be related to the stats_archiver stuff, per: Connection timeouts during statistics
There isn’t a response on that thread on how to diagnose and/or disable/tune the stats_archiver. But it sounds like something is making the server be “underpowered” even though there is barely any traffic.
Hi @misttar, a few questions for you;
- version of couchbase server
- HW config and cluster topology
- data size, item count and view details (#ddocs and #views)
- workload details - mutations vs reads vs queries /sec
Couch version: 2.5.1 enterprise edition (build-1083)
3 nodes, Amazon EC2 m3.medium, CentOS
Data size: ~50Megs of data, ~10k documents, spread across 3 data buckets.
~12 views, 4 design docs between the 3 data buckets
Workload: less then 5 reads per second, writes are even lower, a few an hour.
As you can see our workload is almost non-existent right now. And the slow down doesn’t correlate with any external access to the server (there is no spike in reads/writes, etc).
We expect the workload to increase 100x from what it is as soon as we resolve this issue.
But we can’t do that while this behavior is happening.
apologies for the delayed response; can you also share the query parameters you are passing to 8092? is it a query on a specific view that always delays or randome query on any view will experience the delay?
So we figured out what was causing it.
The stats_archiver runs every 120s to store historical information about the couchbase nodes/servers/buckets/etc. The load that the stats_archiver generated was enough to max out the single vCPU that we had allocated per node in EC2 (m3.medium).
We resized our nodes to c3.xlarge (4 vCPU) and the extra vCPU allows the load to be better distributed and not cause delays and long request times (our average response time is now 10-20ms).
Thanks @misttar, sorry we were not fast enough.