Why am I having to rebuild my nodes every 90 days? High IOWait, SSD backed

Second post, since the first netted lots of eyeballs but no suggestions.

Feels like SSD issues to me, 90 days IOwait starts to climb. If I test the discs they test fine, but couchbase starts timing out and has issues responding to the queries in a timely fashion. So IO wait goes up, which pushes the CPU up, normally running at 2% maybe, and when the server hits the 90 days it’s running at 12-20% cpu.

So I’m at a loss, I’ve got a 5 node CE cluster running data only, with 2 views. Configuration runs fine, but give it 90 days, and the memcached process starts flailing, feels like it’s in a loop or something is going on. This is on Amazon EC2 instance, large i3 and i4 nodes, running Amazon 2 linux. No network errors, disc tests show fine, but something is happening that causes us to have to swap in new hardware, 90 days ish seems to be as long as we get before the system starts timing out. Nothing in the logs that I can see, and it shows it’s writing/reading more data than normal for the single node, but there is no single object being requested more than others on this server, it just seems to be having an issue and I’m not finding the right log file or the right approach in trying to figure it out,

In our monitoring we see IOWait start climbing, with no changes to how we access the box, no increased traffic load.

15933 couchba+ 20 0 153.9g 93.0g 0 S 114.9 75.0 12352:28 /opt/couchbase/bin/memcached -C /opt/couchbase/var/l+

Anyone seen similar or have an idea on how I can track it down? Also note, if I shut the service down and allow it to flush it’s memory, it seems to be happy for a bit, but I can’t really do this, because the box is in bad shape and it takes forever for the process to stop, by that time I’ve got a failover happening, and or a ton of failed requests.

Just trying to get a better understand of where I should be looking in the logs or processes to see what it’s doing at the time of chaos.

High I/O wait means the CPU is outstanding on requests, but a further investigation is needed to confirm the source and effect. Here are a few possible causes of high I/O wait time: Bottlenecks in the storage layer that cause the drive to take more time to respond to I/O requests .

Yes thank you. The issue is it’s only in the membase process. Disk IO tests are fine. I just had to rebuild another system after 90 days.

The only big difference on this side is the views, are the views thrashing the SSD/disk subsystem so badly that it’s causing issues? But again if I shut down membase the disks test out great with a variety of disc test suites. This is 100% membase, and not sure if it’s a caching problem or, and as I pointed out, I truly believe it’s due to the views.

Also to stop that discussion before it starts, N1QL is not cost effective and required a crud ton more resources, so that’s not a starter, we are stuck with views for the time being.