Is couchbase really high availability?

I've a problem with high availability.

Here is my setup:
I have 2 Linux servers (192.168.2.91 and 192.168.2.92), on each is installed: latest couchbase server and keepalived (used for HAProxy, Squid, Dante, VIP: 192.168.2.90).
Client: c# application that puts key-value pairs into couchbase. In app config it is setup to use 192.168.2.91 and 192.168.2.92.

Test 1.
Couchbase cluster is created using web wizard: 1 master + replica.

So it is all tip top: client/server working - no problem. Now, when I shutdown server 192.168.2.91 I expect traffic to be routed to 192.168.2.92. It does not happen. Client gets time out. Ok, restart the client. Client takes about 30 sec to establish connection (new CouchbaseClient()). When CouchbaseClient() returns all calls to put/get data automatically fail. I tried to play with autofailover feature setting it to 30 sec (min allowed). Still no good. In production it does not make sense to have such long fail over anyway. It data is replicated

So, there is no way to to get HA there.

Test 2.
There is no replicas, only stand alone couchbase servers. I set up XDCR between servers on .91 to .92 and from .92 to .91
I can put data to .91 and it will appear in .92, but when I put to .92 there is nothing in .91
Then I tried to play with keepalived - to have one IP (.90) that I will connect to either server. That scenario did not work either. Data was not transferred to another server.

so, where is high availability here? do I miss something in the setup?

Thank you,
Dima

Hi Dima, for #1 you can initiate the failover yourself. We do allow 30s for better protection against network hiccups etc. However your client app can do this through REST in shorter amount of time.
For #2, you need a replication defined from #91 to #92 and another defined from #92 to #91. do you have both defined?

1 Answer

« Back to question.

Hi, cihan!

Thank you for reply!

For #1. Auto fail over is set to 30 sec. Problem is that client app is using couchbase .NET library and does not know if there is a timeout. There is just no exception of any kind. Only waiting for about 30 sec for function to return.
After I wait for 2 min (while it should be 30 sec as for auto fail over rule), I'm still not able to connect to .92 even if I explicitly specify in app.config only 1 IP (192.168.2.92) for couchbase server. It just times out and any other API call will just Immediately return.

For #2, As I said above: I set up XDCR between servers on .91 to .92 and from .92 to .91. So replication should be working both ways. Replication worked once, then I reboot one server (simulated a failure) and whole thing broke and no replication was restored. I tried to start/stop couchbase services, reboot servers, reconfigure XDCR settings - no results.

couchbase works fine if it is one instance only or in multiple servers with no failures, but as soon I start testing HA with simulated node failures, the whole solution becomes not usable.