Node get's killed by OOM killer due to indexer memory exceeded


I have 4 Couchbase nodes in version 5.0.1-5003 community edition with the following configuration :

  • 32 Gb memory
  • 3 buckets with 2 Replicas
  • Data service : 19Gb
  • Index Service : 6Gb

Since we start using N1QL, we are having a very serious issue : Couchbase nodes get killed by OOM killer because the indexer and/or cbq goes far beyond the size of the configured memory.

OOM Killer

Jun  7 07:46:50 node2 kernel: [243379.269217] Out of memory: Kill process 9185 (memcached) score 393 or sacrifice child
Jun  7 07:46:50 node2 kernel: [243379.270200] Killed process 9185 (memcached) total-vm:21640716kB, anon-rss:10934144kB, file-rss:0kB, shmem-rss:4kB

Node 2

1379536 19952 /opt/couchbase/lib/erlang/erts-
1440156 219200 /opt/couchbase/bin/projector
1828192 67160 /tools/dbadm/gdat/jre/Linux_x86_64/bin/java
2506496 659796 /opt/couchbase/lib/erlang/erts-
7344692 607204 /opt/couchbase/lib/erlang/erts-
12111440 4324436 /opt/couchbase/bin/indexer
20407828 15425296 /opt/couchbase/bin/memcached

Node 3

1380052 11216 /opt/couchbase/lib/erlang/erts-
1828192 70644 /tools/dbadm/gdat/jre/Linux_x86_64/bin/java
2310208 591900 /opt/couchbase/lib/erlang/erts-
2603532 475052 /opt/couchbase/bin/projector
7241612 418580 /opt/couchbase/lib/erlang/erts-
15923680 720376 /opt/couchbase/bin/cbq-engine
21214732 15679948 /opt/couchbase/bin/memcached
29247132 5763240 /opt/couchbase/bin/indexer

We were using N1QL indexes before, but they were simple ones, and not so much used. The new index that seems to trigger the overconsumption of memory is the one in this post

crash logs (122.8 KB)

This seems to match this issue MB-20178, but it was supposed to be fixed in version 4.5.1

Hello @tchlyah,
Can you please give details about:
Number of documents?
Documents size (avg size)?
Working set residency required(eg: 80% of data needs to be resident)?

Typically if the cluster is under sized and when cluster is under memory pressure, OS will invoke OOM killer and in Couchbase case, memcached is overarching bad boy, and get killed.

We have 3 buckets, the main buckets where we do N1QL requests has :

  • ~9 millions documents
  • Average size : 4kb size
  • 50K documents with 100kb

The 2 other ones contains each ~4 millions documents with same average size

Every day we reload every data from CB, so we need 100% of resident memory, which is what CB shows in production.

In my perspective, we do not have a very large base, and before the new N1QL requests, we didn’t have any issues!

Thanks for giving details about the setup. I appreciate it.

Yes, # of documents are not big in this case.

I see that you are using N1QL/Query service. By any chance you have primary index on your index nodes?
Is Index service running separately on its node, or its shared with other service?
Are your current SLA’s being met? And what are they?

We don’t recommend using primary index(s) on production clusters.

No we don’t have any primary index! All our requests use indexes specially created for them.

No, unfortunately we do not have entreprise edition yet, and I can’t do anything about it for now. So we can’t separate index/query services from data one.

Until now we didn’t have any issues with Couchbase, our SLA is being met.

For me it should be linked to the newly created index (Array Covering index with UNNEST and condition), maybe it is too much complicated ?

I can’t go to production with these issues!

Glad to know that you don’t have primary index. You can separate individual services. While adding a new server pick the service you need on that node. Hit rebalance.

Same thing can be done from already existing cluster, but in rolling fashion. Remove a node, rebalance. Re-add the node, and this time choose the service.

I would need more context for the newly created index.

I already tried that, that doesn’t work with Couchbase Community Edition. I can’t add a node without kv service.

What context do you need? Everything is in the post I mentionned earlier.

Ok. Thanks for the details. It looks like assigning individual services to the nodes is EE feature.

I see what is happening. Since a node ends up running multiple services and they are resource intensive, even though they are not in use. memcached is getting killed due to that.

At this point, your options are with CE are:

  1. Give more resources to the cluster, to offset for other services running on all the nodes.

With EE:

  1. You will be able to assign individual services to the nodes, thus providing resource isolation.
  2. Get better support, with our support org, getting access to the logs and analyzing them in timely manner.
  3. Getting timely and throughly tested releases with very quick cadence.

I do want to switch to EE, but this isn’t my decision, and this kind of serious bugs doesn’t encourage business to do so, they even incites me to look at the competition…

I’ve doubled the RAM size of the 4 nodes (64Gb each), and it doesn’t change anything, the hosts continue to swap like crazy, and after tweaking linux OOM Killer to not kill memcached process, it’s the indexer and cbq that are being killed.

This is clearly a memory leak! I understand that you offer support only for EE, but that doesn’t mean that you will keep CE with serious bugs like this!

With a four node cluster, two replicas (three copies of data) is not optimal. The cluster will keep a working set of three copies of your data in RAM. I would go down to one replica, allocate a larger % of RAM to the cluster and increase RAM allocation to the indexer to see if this alleviates the OOM issue.