Handling Exception with getFromReplica and CPU Utilization

sdchan1 · August 21, 2013, 3:12am

Hi~! All!

I’m evaluating HA with 4 nodes, 1 replica.
And I made test application with JAVA.

I create a new client object with “FailureMode.Cancel”
(It’s safer than Redistribute or Cancel, when specific node fail occurred at this point)
Amount of queries per second is about 4,000 with 10 threads.
And I made only READ request to cluster.

Handling exception (see sample code):

“fail get data from master” then “get from replica”

try{
client.get(userId);
}catch(Exception e){
try{
client.getFromReplica(userId);
}catch(Exception e1){
addErrorCount();
}
}

HA test scenario is like this.

1) 1 node fail - replica exists
   - force "couchbase server" stop on one of the nodes.
2) 2 node fail - replica doesn't exist partially
   - force "couchbase server" stop on double of the nodes.
   - you can imagine this situation, 
     2 node is on another network, then network switch have problem 5~10 seconds

The first scenario(has replica) has no problem.
But the second one have some abnormal result.
As soon as 2-nodes stopped, CPU resource on application server soared(See sar result below.).


 all      1.37      0.00      0.12      0.00      0.00     98.50
 all      1.46      0.00      0.17      0.00      0.00     98.38
 all      1.21      0.00      0.04      0.00      0.00     98.75 
 all      0.50      0.00      0.00      0.00      0.00     99.50
 all     22.06      0.00      0.00      0.00      0.00     77.94<= nodes stopped
 all     41.67      0.00      0.00      0.00      0.00     58.33
 all     41.67      0.00      0.04      0.00      0.00     58.28
 all     41.97      0.00      0.12      0.00      0.00     57.91
 all     40.88      0.00      0.00      0.00      0.00     59.12
 all     41.67      0.00      0.00      0.00      0.00     58.33
 all     33.93      0.00      0.21      0.00      0.00     65.86 <= nodes joined
 all      2.25      0.00      0.83      0.00      0.00     96.92
 all      2.12      0.00      0.83      0.00      0.00     97.05
 all      2.00      0.00      1.21      0.00      0.00     96.79

I expected there's no problem even if 2 nodes failed, because I set "FailureMode.Cancel". (I thought "FailureMode.Cancel" affects getFromReplica also, so SDK don't send getFromReplica request to failed node.) Of course I can not get anything "master" and "replica" both, partially, it doesn't matter.

If FailureMode.Cancel works on “getFromReplica method” like “get method” does, nothing happened except “error result”. (Not wasting most of CPU resource)

I tested on 24 core (about DB spec) as a benchmark application server.
I think this from count of threads I create. (10 threads on 24 Core)
Resource is much better than web application server used commonly.

If above result is normal, using “getFromReplica method” could be dangerous if more than replica count nodes failed .

Could you answer my question please?

And this is another case, related to getFromReplica method.
Handling exception with getFromReplica makes client server almost stop, as soon as failover on web UI.
I created about 20 test process on same server(not normal usage), and read from replica if “get” method failed. Each process has 1 client object.
I touch nothing on any nodes, just do failover one of the nodes on WEB admin UI.
As soon as I did this, some of the processes do nothing, even I rebalanced cluster and finished.

Data process
Time Q/S Set(ns) cnt Get(ns) cnt Error Slow UID 08/21 15:29:05 829 0 0 411 829 0 0 605666 08/21 15:29:05 0 0 0 0 0 4 0 605666 <= abnormal process 08/21 15:29:05 832 0 0 390 832 0 0 605666

CPU status as soon as failover specific node on WEB/UI.
CPU %user %nice %system %iowait %steal %idle all 86.09 0.00 1.37 0.00 0.00 12.53 all 95.00 0.00 0.71 0.00 0.00 4.29 all 95.50 0.00 0.71 0.00 0.00 3.79 all 94.21 0.00 0.96 0.00 0.00 4.83 all 99.13 0.00 0.54 0.00 0.00 0.33 all 98.96 0.00 0.62 0.00 0.00 0.42

I think getFromReplica is not stable at this point.
Does anyone have similar experience of this problem?

sdchan1 · August 21, 2013, 6:45am

Handling exception with getFromReplica makes client server almost stop, as soon as failover on web UI.
I created about 20 test process on same server(not normal usage), and read from replica if “get” method failed. Each process has 1 client object.
I touch nothing on any nodes, just do failover one of the nodes on WEB admin UI.
As soon as I did this, some of the processes do nothing, even I rebalanced cluster and finished.

Data process
Time Q/S Set(ns) cnt Get(ns) cnt Error Slow UID 08/21 15:29:05 829 0 0 411 829 0 0 605666 08/21 15:29:05 0 0 0 0 0 4 0 605666 <= abnormal process 08/21 15:29:05 832 0 0 390 832 0 0 605666

CPU status as soon as failover specific node on WEB/UI.
CPU %user %nice %system %iowait %steal %idle all 86.09 0.00 1.37 0.00 0.00 12.53 all 95.00 0.00 0.71 0.00 0.00 4.29 all 95.50 0.00 0.71 0.00 0.00 3.79 all 94.21 0.00 0.96 0.00 0.00 4.83 all 99.13 0.00 0.54 0.00 0.00 0.33 all 98.96 0.00 0.62 0.00 0.00 0.42

I think getFromReplica is not stable at this point.
Does anyone have similar experience of this problem?

sdchan1 · August 21, 2013, 6:45am

Handling exception with getFromReplica makes client server almost stop, as soon as failover on web UI.
I created about 20 test process on same server(not normal usage), and read from replica if “get” method failed. Each process has 1 client object.
I touch nothing on any nodes, just do failover one of the nodes on WEB admin UI.
As soon as I did this, some of the processes do nothing, even I rebalanced cluster and finished.

Data process
Time Q/S Set(ns) cnt Get(ns) cnt Error Slow UID 08/21 15:29:05 829 0 0 411 829 0 0 605666 08/21 15:29:05 0 0 0 0 0 4 0 605666 <= abnormal process 08/21 15:29:05 832 0 0 390 832 0 0 605666

CPU status as soon as failover specific node on WEB/UI.
CPU %user %nice %system %iowait %steal %idle all 86.09 0.00 1.37 0.00 0.00 12.53 all 95.00 0.00 0.71 0.00 0.00 4.29 all 95.50 0.00 0.71 0.00 0.00 3.79 all 94.21 0.00 0.96 0.00 0.00 4.83 all 99.13 0.00 0.54 0.00 0.00 0.33 all 98.96 0.00 0.62 0.00 0.00 0.42

I think getFromReplica is not stable at this point.
Does anyone have similar experience of this problem?

daschl · October 8, 2013, 11:54am

Hi,

I’m not sure if FailureMode.CANCEL means what you think. It is actually only used for memcached buckets, if you use couchbase buckets you should not mess with the FailureMode setting.

Also, can you please raise a JCBC bug ticket so we can investigate and follow up more closely? You can find the bugtracker here: http://www.couchbase.com/issues/browse/JCBC

Thanks,
Michael

Topic		Replies	Views
Get from replicate not working as expected Couchbase Server replica , java	3	2372	January 23, 2018
OperationTimeoutException when reading from replicas Java SDK	7	4376	March 10, 2015
Thoughts on when to use GetFromReplica .NET SDK	2	1435	November 7, 2018
Node Offline and GetReplica failure Go SDK	7	2998	July 14, 2015
Couchbase exception when master is down Java SDK	2	1778	January 19, 2016

Handling Exception with getFromReplica and CPU Utilization

Related topics