120% CPU utilization on memcached process, high io wait, ever 90ish days?

So I’m at a loss, I’ve got a 5 node CE cluster running data only, with 2 views. Configuration runs fine, but give it 90 days, and the memcached process starts flailing, feels like it’s in a loop or something is going on. This is on Amazon EC2 instance, large i3 and i4 nodes, running Amazon 2 linux. No network errors, disc tests show fine, but something is happening that causes us to have to swap in new hardware, 90 days ish seems to be as long as we get before the system starts timing out. Nothing in the logs that I can see, and it shows it’s writing/reading more data than normal for the single node, but there is no single object being requested more than others on this server, it just seems to be having an issue and I’m not finding the right log file or the right approach in trying to figure it out,

In our monitoring we see IO Wait start climbing, with no changes to how we access the box, no increased traffic load.

15933 couchba+ 20 0 153.9g 93.0g 0 S 114.9 75.0 12352:28 /opt/couchbase/bin/memcached -C /opt/couchbase/var/l+

Anyone experience something similar and or how exactly am I to track this down? If I wait for 90 days, I will start seeing increased timeouts. It really seems that is starts to be churning without any real added traffic/requests etc, so what is it doing? XDCR is out only, so it’s not an influx of XDCR data.

Anyone seen similar or have an idea on how I can track it down? Also note, if I shut the service down and allow it to flush it’s memory, it seems to be happy for a bit, but I can’t really do this, because the box is in bad shape and it takes forever for the process to stop, by that time I’ve got a failover happening, and or a ton of failed requests.

Just trying to get a better understand of where I should be looking in the logs or processes to see what it’s doing at the time of chaos.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Hi @ToryMBlue , apologies for the delay in getting back to you here. Can you let us know what version of Couchbase you’re using? Given that this sounds like a regular/periodic issue, it may very well be something that we’ve fixed in later versions (or perhaps needs to investigate on the latest).

For these types of issues, you may also consider opening a ticket on issues.couchbase.com where it will be easier to collect troubleshooting information and have our engineering team investigate for a possible bug.