Bucket Hard out of Memory Error - Eviction bottleneck?


In the last few days I’ve been experiencing a strange problem.
I am using CB Server 4.0 with a full-eviction bucket.

For some reason, occasionally one or another server will run into the ‘Hard OoM’ error.
At these points, I’ve noticed that the memcached process is using the full amount of memory that the data service has been allocated.

Once it reaches this state, it never leaves it, regardless of the system load.
The only way to fix it is to manually restart the memcached process (which does work).

Therefore, I suspected that we were writing new documents faster than we could evict them; however, the disk throughput is very low (10s-100s KBs per second) compared to the potential disk throughput (AWS SSDs which are striped together). Furthermore, the disk and replication queues and not significantly high (they tend to spike and return to 0 quite quickly).

I’m wondering what could be cause of this, is there a known bug in 4.0 that can cause such an issue?

Also, as more evidence, we didn’t have this problem until we started building GSI indexes; however the issue crops up on machines which DON’T contain an index.

Image below shows vbucket status at the time for the affected node.

looking at your screenshot, you seem to be missing vbuckets. You need to have 1024 vbucket both for active and replica side or are you running some special configuration in production?

Oh this screenshot was for a single server node and there are 3 nodes, so for the whole bucket there are 1024 vBuckets

I’m wondering if, since this didn’t start occurring until after trying to build GSI indexes it might be related to this:

The first time I tried to build a primary index, it required significantly more space than I expected and failed.
Now the cluster is in a strange state, where on one node a N1QL query on ‘system:indexes’ returns a single index in the ‘pending’ state (which doesn’t show up in the admin ui), but querying from the other nodes results in an empty result set (I determined this by using cbq and changing the --engine option to point to different hosts in the cluster).

Edit: I realized this symptom was due to caching in the cbq-engine; after restarting that process on each node the inconsistency disappeared. So that seems unrelated to the above issue

I think the underlying reason was a system out of memory error.

The indexer process was using way more memory than I expected.
Despite only allocating 256MB for the indexer it was using ~3-3.5GB of RAM.

In addition the projectors were using about 1.2 GB of RAM.

Is there a good method for predicting how much RAM the projector and indexer processes actually use in practice?
Or is soak testing and profiling the only way?

I found another ticket which points to exactly the same issue occurring in the past:

However, it seemed that was resolved as part of this ticket: https://issues.couchbase.com/browse/MB-12451

Are there any other known memory leaks to look out for? Could it be due to some interaction between the projector and the memcached process (since this only started happening after enabling indexing)?