Out of memory, ejection not taking place, cluster useless

Hello. I gave up to solve my issue myself and I didn’t find it described elsewhere.

My usecase:
I intended to use Couchbase as a very simple storage for a lot of data. It’s my secondary storage, the real load is elsewhere. I have two buckets. One big when documents are stored once and never edited with almost 1 billion documents. The second one is smaller (under 100 million documents), updates are possible. I have no secondary indices, so it’s basically a key value store, although the documents are structured and have approximately15 attributes. My aim is few dozens writes per second 24/7 and occassional reads (much less compared to writes). Active dataset should be tiny (well under 1%). It’s something like a time series, although not 100%. I don’t mind cache misses on reads, I’m more concerned about stable writes. I need the cluster to be as maintenance-free as possible - I want to monitor the beast and add a new node when time comes.

Cluster setup:
I have 4 identical VM nodes dedicated to Couchbase only:

  • 4 cores
  • 4 GB RAM
  • 1 TB HDD
  • Ubuntu 14.04.4 LTS
  • nothing unusual

Couchbase server 4.0.0 CE
Data RAM quota: 2048 MB
Index RAM quota: 1024 MB
The Data RAM quota is spread evenly among both buckets (if I understand this right).
Both buckets use full ejection and have 2 replicas.
Both data and index are on the same partition (I know it’s not recommended).

The problem
The data caused the cluster to be more fragmented than I expected during pre-production filling-up, cluster failed completely to send e-mail alerts (I believe this bug is alive at least since 2.0 but never mind now) and disks got full. I added another node (I had three originally), run reballance (OK) and compaction (OK). But during the compaction the cluster begun to use more and more RAM. In the end it reported over 1 GB of overused RAM. I stopped all traffic coming to the cluster since I hit the reballance. It didn’t work well anyway during the reballance (a lot of time-outs) and the cluster was accepting virtually no writes during the compaction. But never mind this as well. Since then the cluster is stuck on a severe RAM overuse and it refuses all reads and writes reporting an out of memory error. I tried a series of 1000 writes and 1000 reads separated by 1 second - just to nudge the cluster gently into some action. But no ejection took place and the cluster was still as good as dead.

Something from the OS:

  • memcached process was taking about 2.1 to 2.6 GB of RAM (well over the quota)
  • beam process was taking about 0.5 GB RAM
  • no other significant memory consumers, no sigificant CPU usage

In the end I restarted the couchbase service on one of the nodes, the cluster is reballancing at the moment and I expect that by restarting all nodes sequentially the RAM will be freed and cluster will be operational again eventually. I thing it’s clear that having a completely dead cluster after each compaction is unacceptable. Am I doing something wrong? Is Couchbase able to fulfill my usecase?

Advice would be dearly appreciated. I’d like to keep the couchbase because I’ve already invested a lot of work into it, but unless I solve this issued I will have to use another storage.

EDIT: all queues were empty, there was no traffic and no leftovers of any sort I would recognize or notice. Everything seemed to be perfectly calm and all right but the memory overuse and the fact it didn’t work at all. I have screenshots from the admin interface, but the forum does not let me to upload more than one image, so ask away I you want to know more information.

Hi,

Thank you for using Couchbase and I’m sorry to see the issues you are encountering.
And thank you for the detailed problem description.

Based on what you provided, here are a few of my thoughts:

  1. Because you are not using index/query, I recommend only enabling data service on all nodes. Corresponding, index RAM quota can be set to 0.
  2. It seems that this issue only happens during rebalance. It seems to me that is caused by high memory pressure from memcached which resulted in eviction malfunction. We have a sizing guide here: http://developer.couchbase.com/documentation/server/current/install/sizing-general.html. Even though 4GB is the recommended min RAM quota, I would recommend increasing that based on what you observed in the cluster.
  3. When you say, you have 2 replicas of data, do you mean, you have 1 master + 1 replica or 1 master + 2 replica? In most use cases, setting # of replica = 1 is good enough. Setting # of replica to 2 will increase RAM pressure.

Thanks,
Qi

Thank you for the reply.

  1. Excellent point. I missed or didn’t realise that I don’t need the indexing service when I need the documents to be indexed only by ID. The comunity edition does not allow me to turn the indexing service off though and the minimum RAM quota is 256 MB, but it helps anyway.

  2. Well, the sizing guide tells me that I need over 60 GB just for the metadata :slight_smile: Not very useful for my usecase. I can’t use too many resources for this cluster, so I’ll have to keep 4 GB per node.

  3. I was using the Couchbase way of counting the replicas - I ment 1 master + 2 replicas. I was seriously considering using only one replica when I was designing the cluster because it runs on VMs, so I don’t have to worry about hardware failures, but I chose more safety in the end.

I reconfigured the RAM quotas, lowered the number of replicas to 1 and run the reballance again. Hopefully I will have a healthy cluster sometime tomorrow.

So far the cluster seem to be all right. I’m wondering what is going to happen after the next compaction.

The cluster became completely stuck and unusable again during the next round of compaction and reballance. I guess no more Couchbase for me.