120% CPU utilization on memcached process, high io wait, ever 90ish days?

ToryMBlue · March 15, 2023, 4:48am

So I’m at a loss, I’ve got a 5 node CE cluster running data only, with 2 views. Configuration runs fine, but give it 90 days, and the memcached process starts flailing, feels like it’s in a loop or something is going on. This is on Amazon EC2 instance, large i3 and i4 nodes, running Amazon 2 linux. No network errors, disc tests show fine, but something is happening that causes us to have to swap in new hardware, 90 days ish seems to be as long as we get before the system starts timing out. Nothing in the logs that I can see, and it shows it’s writing/reading more data than normal for the single node, but there is no single object being requested more than others on this server, it just seems to be having an issue and I’m not finding the right log file or the right approach in trying to figure it out,

In our monitoring we see IO Wait start climbing, with no changes to how we access the box, no increased traffic load.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15933 couchba+ 20 0 153.9g 93.0g 0 S 114.9 75.0 12352:28 /opt/couchbase/bin/memcached -C /opt/couchbase/var/l+

Anyone experience something similar and or how exactly am I to track this down? If I wait for 90 days, I will start seeing increased timeouts. It really seems that is starts to be churning without any real added traffic/requests etc, so what is it doing? XDCR is out only, so it’s not an influx of XDCR data.

Anyone seen similar or have an idea on how I can track it down? Also note, if I shut the service down and allow it to flush it’s memory, it seems to be happy for a bit, but I can’t really do this, because the box is in bad shape and it takes forever for the process to stop, by that time I’ve got a failover happening, and or a ton of failed requests.

Just trying to get a better understand of where I should be looking in the logs or processes to see what it’s doing at the time of chaos.

system · June 13, 2023, 4:48am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

perry · June 22, 2023, 11:20am

Hi @ToryMBlue , apologies for the delay in getting back to you here. Can you let us know what version of Couchbase you’re using? Given that this sounds like a regular/periodic issue, it may very well be something that we’ve fixed in later versions (or perhaps needs to investigate on the latest).

For these types of issues, you may also consider opening a ticket on issues.couchbase.com where it will be easier to collect troubleshooting information and have our engineering team investigate for a possible bug.

Topic		Replies	Views
Why am I having to rebuild my nodes every 90 days? High IOWait, SSD backed Couchbase Server	3	411	September 30, 2023
High disk write load and flood of memcached logs Couchbase Server	7	2901	December 22, 2016
Memcached constant load 20% to 30% in spite of no activity at all Couchbase Server	0	1669	July 29, 2014
Couchbase pod consumes high CPU and memory continously. on idle state Kubernetes	5	2658	September 8, 2021
Service 'memcached' exited with status 134. Restarting - keeps happening Couchbase Server	4	1570	March 31, 2021

120% CPU utilization on memcached process, high io wait, ever 90ish days?

Related topics