High cpu load even when Java database client is idle
I am using the Java „Release 1.1 Developer Preview“ library to connect to Couchbase 2.0 preview 3. All is working fine, including view access. However as soon as I connect to the database one core on my quad core Mac with Lion goes up to 100% cpu load.
This is enough to demonstate it:
import net.spy.memcached.CouchbaseClient;
...
ArrayList baseURIs = new ArrayList();
baseURIs.add(new URI("http://localhost:8091/pools"));
couchbaseClient = new CouchbaseClient(baseURIs, "default", "");
Scanner scanner = new Scanner(System.in);
String input = scanner.next();
couchbaseClient.shutdown(10, TimeUnit.SECONDS);
"default" is the only bucket in the server and I am using just one node. After entering input the cpuload is going down to 0% again.
I have added those jars to my build path:
commons-codec-1.6.jar
httpclient-4.1.2.jar
httpcore-4.1.4.jar
httpcore-nio-4.1.4.jar
jettison-1.3.1.jar
netty-3.2.7.Final.jar
spymemcached-2.8-preview3.jar
My problem looks similar to this one here:
http://www.couchbase.com/forums/thread/using-couchbaseclient-connect
Any solution for this?
Stefan
PS: A small add on as this might be related. Suddenly the server also gets 100-150% cpuload (beam.smp) right from the start, even without any client access at all. Is there already a schedule for a new developer preview release?
Great, thank you. Do you already have a release schedule when the update will be available?
Is there also a fix that the server also uses 100% cpu of one core? This seems unrelated, as this also happens before any client connects to the server.
Stefan
From what I can tell on the bug report this is unresolved and I just hit it with one client LOL.
Apparently the offending stack trace is this although I am just reacting to a quick look at dump,
Thread 10100: (state = IN_JAVA)
- com.couchbase.client.ViewConnection.handleIO() @bci=33, line=142 (Compiled frame; information may be imprecise)
- com.couchbase.client.ViewConnection.run() @bci=15, line=253 (Compiled frame)
My main thread is trying to do an upsert, I go get a record and modify it,
line 302 is this with hs the string value
rv = client.cas(key, s.getCas(), hs);
and s having been obtained as,
CASValue s = client.gets(key);
Thread 10093: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
- java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Interpreted frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(int, long) @bci=122, line=1033 (Interpreted frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(int, long) @bci=25, line=1326 (Interpreted frame)
- java.util.concurrent.CountDownLatch.await(long, java.util.concurrent.TimeUnit) @bci=10, line=282 (Interpreted frame)
- net.spy.memcached.internal.OperationFuture.get(long, java.util.concurrent.TimeUnit) @bci=6, line=87 (Interpreted frame)
- net.spy.memcached.MemcachedClient.cas(java.lang.String, long, int, java.lang.Object, net.spy.memcached.transcoders.Transcoder) @
bci=19, line=559 (Interpreted frame)
- net.spy.memcached.MemcachedClient.cas(java.lang.String, long, java.lang.Object, net.spy.memcached.transcoders.Transcoder) @bci=8
, line=539 (Interpreted frame)
- net.spy.memcached.MemcachedClient.cas(java.lang.String, long, java.lang.Object) @bci=9, line=583 (Interpreted frame)
- com.phluant.third.couch.Main.rmw(com.couchbase.client.CouchbaseClient, java.lang.String, java.util.Map, java.lang.String[], long
[]) @bci=174, line=302 (Interpreted frame)
- com.phluant.third.couch.Main$1.run() @bci=78, line=243 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=722 (Interpreted frame)
I just went ahead and patched the code, no idea how this is supposed to work but it is not hard
to see how a tight loop could develop. This seems to work, I ran 10 threads ok, the
transactions seem right with no obvious loss of performance ( 50 ms for cas based upsert to
remote server ) but signigicant CPU reduction ( I'm sure if you have any idea what the code
does you can do better for yield or wait etc, my code it the sleep/do_something stuff that was uselss
com/couchbase/client/ViewConnection.java
public void handleIO() {
boolean did_something=false;
for (ViewNode node : couchNodes) {
node.doWrites(); did_something=true;
}
for (ViewNode qa : nodesToShutdown) {
nodesToShutdown.remove(qa);
Collection notCompletedOperations = qa.destroyWriteQueue();
try {
qa.shutdown();
} catch (IOException e) {
getLogger().error("Error shutting down connection to "
+ qa.getSocketAddress());
}
redistributeOperations(notCompletedOperations);
did_something=true;
}
//System.out.println(" mike couch");
//if (!did_something)
{try { Thread.sleep(1); } catch (Exception e) {} }
}
Sleeping when something doesn't happen on the view IO thread doesn't seem like the right thing to do to me. You post did however point me to what I think is the real problem. When the operation queue is empty we should be blocking until an operation becomes available. I just submitted a change to fix the issue, but haven't tested it yet. The change is here if your interested:
http://review.couchbase.org/#change,14959
Also, I appreciate the code attached to your post. It provided a great hint to the root of the problem.
yeah, that sounds right but I I had no idea what to wait for and the sleep made it work for now. I guess I could have looked into the methods called from the loop but thought someone may know off hand.
No problem. I actually suspected something else more complex was the issue which was why I hadn't taken the time to look at it. You code made the problem very obvious to me and I really appreciate it.
Thanks, I probably could have spent a few minutes trying to find that but glad it helped.
I guess in that loop there were only a few calls but still trying to figure out a
wait/notify would have involved a lot of effort being unfamiliar with the code.
It worked for me LOL.
btw, take throws interrupted exception and you probably do want to poll although that
would require more code mods. I'm looking at threading models more generally, and my first case has many clients so I may dig into this although it is not high priority for us right now.
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html#take()
Why wouldn't we want to do to take() just because it throws an exception? In my code I caught the exception an logged that we were interrupted. We might want to do a little bit more than just log the exception, but I don't see how take() is not the best solution. If you have different thoughts on this I am interested in hearing them.
The exception wasn't the issue, take() seems fine if you have one thread per client. I was thinking more generally about a polling / selector scheme here but again I'm not real sure what the load on this thread would be in real life etc. Still just looking.
Thanks.
That thread is just responsible for moving view requests to a per server connection pool and taking care of topology changes in the cluster. I wrote all of this code a while ago I'm sure there are many ways we could improve the performance. If you see anything that seems like it can be improved or have any other questions let me know. We also have a code review system set up at review.couchbase.com if you make any changes and want to contribute them.
well, thanks but it would be quite substantial change and without knowing how the IO and threads interact it would be a bit of an effort to design well but generally any blocking call would seem to be suspect.
I just happen to be thinking about the concept in general as we have a netty and thread-per-page front end
on our server right now and trying to find bottle necks etc. For couchbase, my interest was
doing fast upserts and I would be more interested in extending the incr method to take a vector of
longs instead of just one value.
FYI,
There is a JIRA open for this at: http://www.couchbase.com/issues/browse/JCBC-26
I have attached a patch to the JIRA as well as a pull request on github (https://github.com/couchbase/couchbase-java-client/pull/1).
This fixed it for me.
well, what do you do when it times out? One thread babysitting one client I guess could take() and throw if interrupted. I used sleep because I was too lazy to dig down one layer and poll() in theory makes a lot of sense if you are using a thread shared with other stuff.
I've posted a possible fix for the 100% cpu issue here:
http://www.couchbase.com/issues/browse/JCBC-20
http://www.couchbase.com/issues/browse/JCBC-26
We're planning to get a DP2 later this week. I've tossed together a build with some of the latest features and these fixes (hasn't been fully tested, some code still in review) here:
http://dl.dropbox.com/u/1537838/CouchbaseJavaObserve.zip
If anyone wants to give it a shot and pass along feedback, that'd be greatly appreciated.
Is it possible to update DP2 into the maven repository? I'm experiencing 100% CPU using the 1.1-dp release and would like to test the 1.1-dp2. According to this - the issue should be resolved.
Our java developers are putting the final touches on a 1.2-dp release and are aiming to release it early this week so it could be finished as soon as Monday.
this is great news. looking forward. my Mac is constantly on 100% CPU - at this state I can never move to production...
The fix for this is in the 1.1-dp2, now posted: couchbase.com/develop/java/next
I hope that helps, and please give us more feedback!
Upgrading to 1.1-dp2 seems solving the CPU problem. However, dp2 cannot find the views. After I upgraded to dp2, I can no longer use the existing views.
CouchbaseClient.getViews(); shows 0 views.
To be sure, I downgraded to 1.1-dp. Views are back with a good old %100 cpu utilization problem.
I suspect I know what the views issue is. Are you using a bucket with authentication by chance?
If so, the best thing to do would be to upgrade to build 1495 or later (see http://www.couchbase.com/downloads-all). Long story short, there were some authentication changes, and thus 1.1-dp2 must be used with build 1495 and later if you're using a bucket with authentication.
I can confirm the 1.1.-dp-2 solved the high CPU problem. In regards to views - I don't see any problem but I'm using the default bucket.
Hello,
Yes, absolutely. I am using authentication.
Thank you for the solution, I installed build 1495 and retried, problem seems resolved but unfortunately 1495 and 1554 are not stable enough ): (I know they meant not to be stable (: )
I tried both of them and they failed (processes crashed) after I modify & re-publish a view to production.
I am hoping a somewhat more stable build.
Cheers,
First off, I assume your using the Spymemcached 2.8 developer preview (probably preview 3). We are aware of this issue and are working on a fix for it. It will be available in the next java developer preview release. Also, we have split the Spymemcached into Spymemcached and Couchbase Client. Couchbase Client will contain all of the Couchbase code and Spymemcached will be specifically for memcached server. Couchbase Client 1.1-preview should contain the fix to this bug so when we release it you should upgrade to this version.
You may track this issue here:
http://www.couchbase.com/issues/browse/SPY-64