Handling Exception with getFromReplica and CPU Utilization

Hi~! All!

I’m evaluating HA with 4 nodes, 1 replica.
And I made test application with JAVA.

I create a new client object with “FailureMode.Cancel”
(It’s safer than Redistribute or Cancel, when specific node fail occurred at this point)
Amount of queries per second is about 4,000 with 10 threads.
And I made only READ request to cluster.

Handling exception (see sample code):

“fail get data from master” then “get from replica”

try{
client.get(userId);
}catch(Exception e){
try{
client.getFromReplica(userId);
}catch(Exception e1){
addErrorCount();
}
}

HA test scenario is like this.

1) 1 node fail - replica exists
   - force "couchbase server" stop on one of the nodes.
2) 2 node fail - replica doesn't exist partially
   - force "couchbase server" stop on double of the nodes.
   - you can imagine this situation, 
     2 node is on another network, then network switch have problem 5~10 seconds

The first scenario(has replica) has no problem.
But the second one have some abnormal result.
As soon as 2-nodes stopped, CPU resource on application server soared(See sar result below.).

all 1.37 0.00 0.12 0.00 0.00 98.50 all 1.46 0.00 0.17 0.00 0.00 98.38 all 1.21 0.00 0.04 0.00 0.00 98.75 all 0.50 0.00 0.00 0.00 0.00 99.50 all 22.06 0.00 0.00 0.00 0.00 77.94<= nodes stopped all 41.67 0.00 0.00 0.00 0.00 58.33 all 41.67 0.00 0.04 0.00 0.00 58.28 all 41.97 0.00 0.12 0.00 0.00 57.91 all 40.88 0.00 0.00 0.00 0.00 59.12 all 41.67 0.00 0.00 0.00 0.00 58.33 all 33.93 0.00 0.21 0.00 0.00 65.86 <= nodes joined all 2.25 0.00 0.83 0.00 0.00 96.92 all 2.12 0.00 0.83 0.00 0.00 97.05 all 2.00 0.00 1.21 0.00 0.00 96.79 I expected there's no problem even if 2 nodes failed, because I set "FailureMode.Cancel". (I thought "FailureMode.Cancel" affects getFromReplica also, so SDK don't send getFromReplica request to failed node.) Of course I can not get anything "master" and "replica" both, partially, it doesn't matter.

If FailureMode.Cancel works on “getFromReplica method” like “get method” does, nothing happened except “error result”. (Not wasting most of CPU resource)

I tested on 24 core (about DB spec) as a benchmark application server.
I think this from count of threads I create. (10 threads on 24 Core)
Resource is much better than web application server used commonly.

If above result is normal, using “getFromReplica method” could be dangerous if more than replica count nodes failed .

Could you answer my question please?

And this is another case, related to getFromReplica method.
Handling exception with getFromReplica makes client server almost stop, as soon as failover on web UI.
I created about 20 test process on same server(not normal usage), and read from replica if “get” method failed. Each process has 1 client object.
I touch nothing on any nodes, just do failover one of the nodes on WEB admin UI.
As soon as I did this, some of the processes do nothing, even I rebalanced cluster and finished.

Data process

Time Q/S Set(ns) cnt Get(ns) cnt Error Slow UID
08/21 15:29:05 829 0 0 411 829 0 0 605666
08/21 15:29:05 0 0 0 0 0 4 0 605666 <= abnormal process
08/21 15:29:05 832 0 0 390 832 0 0 605666

CPU status as soon as failover specific node on WEB/UI.

CPU %user %nice %system %iowait %steal %idle
all 86.09 0.00 1.37 0.00 0.00 12.53
all 95.00 0.00 0.71 0.00 0.00 4.29
all 95.50 0.00 0.71 0.00 0.00 3.79
all 94.21 0.00 0.96 0.00 0.00 4.83
all 99.13 0.00 0.54 0.00 0.00 0.33
all 98.96 0.00 0.62 0.00 0.00 0.42

I think getFromReplica is not stable at this point.
Does anyone have similar experience of this problem?

Handling exception with getFromReplica makes client server almost stop, as soon as failover on web UI.
I created about 20 test process on same server(not normal usage), and read from replica if “get” method failed. Each process has 1 client object.
I touch nothing on any nodes, just do failover one of the nodes on WEB admin UI.
As soon as I did this, some of the processes do nothing, even I rebalanced cluster and finished.

Data process

Time Q/S Set(ns) cnt Get(ns) cnt Error Slow UID
08/21 15:29:05 829 0 0 411 829 0 0 605666
08/21 15:29:05 0 0 0 0 0 4 0 605666 <= abnormal process
08/21 15:29:05 832 0 0 390 832 0 0 605666

CPU status as soon as failover specific node on WEB/UI.

CPU %user %nice %system %iowait %steal %idle
all 86.09 0.00 1.37 0.00 0.00 12.53
all 95.00 0.00 0.71 0.00 0.00 4.29
all 95.50 0.00 0.71 0.00 0.00 3.79
all 94.21 0.00 0.96 0.00 0.00 4.83
all 99.13 0.00 0.54 0.00 0.00 0.33
all 98.96 0.00 0.62 0.00 0.00 0.42

I think getFromReplica is not stable at this point.
Does anyone have similar experience of this problem?

Handling exception with getFromReplica makes client server almost stop, as soon as failover on web UI.
I created about 20 test process on same server(not normal usage), and read from replica if “get” method failed. Each process has 1 client object.
I touch nothing on any nodes, just do failover one of the nodes on WEB admin UI.
As soon as I did this, some of the processes do nothing, even I rebalanced cluster and finished.

Data process

Time Q/S Set(ns) cnt Get(ns) cnt Error Slow UID
08/21 15:29:05 829 0 0 411 829 0 0 605666
08/21 15:29:05 0 0 0 0 0 4 0 605666 <= abnormal process
08/21 15:29:05 832 0 0 390 832 0 0 605666

CPU status as soon as failover specific node on WEB/UI.

CPU %user %nice %system %iowait %steal %idle
all 86.09 0.00 1.37 0.00 0.00 12.53
all 95.00 0.00 0.71 0.00 0.00 4.29
all 95.50 0.00 0.71 0.00 0.00 3.79
all 94.21 0.00 0.96 0.00 0.00 4.83
all 99.13 0.00 0.54 0.00 0.00 0.33
all 98.96 0.00 0.62 0.00 0.00 0.42

I think getFromReplica is not stable at this point.
Does anyone have similar experience of this problem?

Hi,

I’m not sure if FailureMode.CANCEL means what you think. It is actually only used for memcached buckets, if you use couchbase buckets you should not mess with the FailureMode setting.

Also, can you please raise a JCBC bug ticket so we can investigate and follow up more closely? You can find the bugtracker here: http://www.couchbase.com/issues/browse/JCBC

Thanks,
Michael