[JCBC-197] High Couchbase clients apps' CPU after upgraded from Couchbase Client 1.0.1 and spy 2.8.0 to 1.0.3 and 2.8.2. Created: 21/Dec/12  Updated: 21/Dec/12  Resolved: 21/Dec/12

Status: Closed
Project: Couchbase Java Client
Component/s: Core
Affects Version/s: 1.0.3
Fix Version/s: 1.0.3
Security Level: Public

Type: Bug Priority: Major
Reporter: Saran Kumar Assignee: Saran Kumar
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
 Incident #2373

Please see issue we encountered with our latest version below.

This passed QA on stage but when going on to a live environment with full production load we see the behavior below. At first this occurred on a cluster we had just rebalanced. During that rebalance we saw rise in couchbase clients apps' CPU, which did not decrease after the rebalance was (successfully) done, until we restarted said clients. We suspected that the problem below is directly related to the issues we saw during rebalance so we also tested it on a different cluster that did not go any such rebalance. Results were the same. After searching all over to see what changed we realized that one change during this version was that we upgraded from Couchbase Client 1.0.1 and spy 2.8.0 to 1.0.3 and 2.8.2. We then took that exact build swapping the 1.0.3 with the 1.0.1 jars and everything started behaving fine.

The reason we MUST have 1.0.3 on production is the following from 1.0.3's release notes (http://www.couchbase.com/docs/couchbase-sdk-java-1.0/couchbase-sdk-java-rn_1-0-3.html):

It was found that in the dependent spymemcached client library that errors encountered in optimized set operations would not be handled correctly and thus application code would receive unexpected errors during a rebalance. This has been worked around in this release by disabling optimization. This may have a negilgable drop in throughput but shorter latencies.

We believe the issues mentioned above on the clients during the rebalance are exactly this.

1. Any ideas on reason for this?
2. How would you advise to proceed.

Cheers,
Ira

Hi

1. One server is putting data to a memcached bucket. TTL is about 30 minutes.
2. Another server tries to get this data but randomly fails (at about of 50% miss rate). We are getting nulls instead of real values. We are using asyncGet and then Future.get() with timeout of 5 seconds. We did not observe that timeout was reached.
Time period between (1) and (2) is less than a minute. We debugged (1) and saw that it is being written without errors.
No exceptions or errors.
Data cluster wasn't heavy loaded, other clients (1.0.1) were working at the same time with this bucket and operated properly.

Sergey

From: Ira Holtzer
Generated at Mon Sep 01 18:07:17 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.