Couchbase cluster instability

Hi everyone, we’re facing some issues when trying to deploy a distributed Couchbase cluster.
We want to migrate an old cluster (5.1.1) to a new and stable version (6.6.0). In the new one, we’ll split the cluster in data and index, as image above shows, using XDCR protocol as data replication between clusters.

We use 4 VMs per cluster, each with:
4GB RAM
4VCPUs
1 SSD disk OS - 150GB
1 SSD 100GB extreme SSD disk /opt/couchbase/data (to avoid disk concurrecnly)
1 SSD 100GB extreme SSD disk /opt/couchbase/index (to avoid disk concurrecnly)

Both clusters (data and index) are in the same VPC region to improve internal connection throughput. We’re using a finning tunning shell script to improve Linux settings (improve max open files, disable SWAP, etc…)

First of all, we create the buckets on both clusters and start copy items using XDCR. with data-cluster everything looks good, but when we start to create some index on index-cluster we face strange behavior with clusters. After creating index, a lot of couchbase nodes became down.

Remember that the cluster has this problem without any user or query being done. We try to grow the disk to improve throughout, but every time that we grow the disk, we face the same behavior.

we suspect that some process is consuming resources even with the cluster is not in use.

Best regars

Diego


Couchbase topology

Captura de tela de 2021-07-07 11-43-41
Disk throughout

Hi @diegopedroso ,

A bit of a tangent, but can you share some background information about why you want separate clusters for DATA and INDEX? I ask because there might be a better way to achieve your goal with a single cluster and Multi-Dimensional Scaling (MDS).

Thanks,
David

Hi David, thank for your reply.

We use this approach to improve our cluster reliability, in this way, we believe that slipt services between differents clusters will improve availability clusters. What further information do you want to know?

Best Regards

Hi Diego,

Disclaimer: I am not a Solutions Engineer, and I’m probably not the right person to address the specific instability issue you’re seeing. That said, here are some questions that might help us understand your design:

When you talk about splitting “services”, are you talking about your own application services, or Couchbase services (like Key/Value, Query, Index, Full-Text Search, etc.)?

Are you using bi-directional XDCR between the “data” and “index” clusters?

What kind of requests do you send to the “data” cluster, and what kind to the “index” cluster?

Are you an Enterprise Edition subscriber? One way to achieve high availability with cluster failover is to use the Multi-Cluster Aware (MCA) Java client available to Enterprise subscribers. (Not trying to up-sell you… just discussing all the possibilities I’m aware of).

Thanks,
David

Hi David,

Answer your first question, yes, we splitting in this way
data cluster: data
index cluster: data,index,query and fts

The second one, yes, we use Bi-directional XDCR replication (to avoid high CPU consume we set XDCR Maximum Processes=1)

In the third question, basically, we received queries, GET and POST using a REST API.

FInally, we don’t use enterprise edition, the high costs make it impossible, thus we need to use community edition instead.

But what I find strange is that even without any requet , the cluster behaves strangely, without receiving requests, it often falls.

Best Regards

Diego

As I said before, I’m not a Solutions Engineer, but I do think it’s worth seriously considering a single cluster with Multi-Dimensional Scaling (MDS). XDCR is great for replication between remote data centers, but it has drawbacks as well. For one thing, XDCR makes it difficult to have strong consistency (“read your own writes”) between clusters; even if you don’t need the consistency now, you might in the future. It’s also possible XDCR is related to the instability issues you’re facing; at the very least, it’s certainly a suspect.

In addition to the documentation link I shared earlier, check out this blog article about MDS. The article is old, but it does a great job of showing the benefits of MDS.

Since availability is your primary concern, one approach would be to take advantage of local replication. If a read fails because the node hosting the active partition is unavailable, your app can read [a potentially stale version] from a replica instead. You could also look into Auto-Failover, which can rapidly remove a failed node from the cluster and promote replica partitions to “active”.

One final note, which might not be applicable to your current situation, but it’s at the front of my mind so I’ll share it anyway. There’s an old deployment pattern that was popular with LDAP (and probably other systems) where you’d write to one server (the source of truth) and read from a set of replicas. This would let you scale up your read performance by adding more replicas. With Couchbase, sending reads and writes to separate servers/clusters is an anti-pattern. With Couchbase, the source of truth and the replicas are all present on the same pool of nodes. To scale up your system (or add more replicas up to the limit of 3) you just add more nodes.

tldr;

  • Using a single cluster will give you fewer headaches and might even solve the instability problem you’re seeing.
  • Multi-Dimensional Scaling lets you decide what kind of resources to devote to Data Service vs Index/Query/FTS.
  • Local replication can give you high availability at low cost, without the drawbacks of XDCR.

I’ll step aside now and let the real experts take a crack at your problem :wink:

Thanks,
David

Hi David, thank you again for share your knowledge, I’ll deploy another setup following your tips and let you know if works.

Best Regards

Diego

Hi David, we still facing issues when deploying the cluster, As I told you, we use a separated disk to avoid read/write competition.
Disks:
1 for OS, 1 for Data and 1 for Index.

I note on Google Cloud VM dashboard that we are reaching disk limits, OS disks are many queries on disks, while the others have little read and write.

Captura de tela de 2021-07-13 12-42-13

I mount Data Disk on /opt/couchbase/var/lib/couchbase/data/ but I note that another couchbase files in the previous directory, do you think that’s can be a problem?

Regards

Diego

Hi Diego,

I’m afraid I don’t have the expertise to help with this issue, and it’s been a while without somebody else chiming in. If you’re still struggling with this, a good way to get more eyes on it might be to start a new thread focusing on just the disk throughput issue.

Thanks,
David

Hi David, sorry about the delay

I change the setup and improve VM size, using a single cluster as you suggested.
Now, we have 4 nodes with 16GB and 4 VCPUS, using a SSD disk for data and index.
Each node has the services of data, index, query and fts

As it is not possible to separate the services into exclusive nodes, we scaled the 4 nodes to the maximum to increase the performance of the 4 services.

So far we had no problems with tests, XDCR really was consuming a lot of resources, but there were some issues like scale each service by nodes that only available only Enterprise edition.

The setup is well optimized for the CE edition, I believe that to increase performance only EE will solve.

Thank you

Best Regards

Diego