Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Membase | Membase Server 1.7.x

problems during heavy load

13 replies [Last post]
  • Login or register to post comments
Thu, 07/07/2011 - 11:14
a.reiter
Offline
Joined: 10/13/2010
Groups: None

Hi every body
we are using this version of spymemcached 2.7-2-gbd6e366 http://www.couchbase.org/wiki/display/membase/prerelease+spymemcached+vb...
there are 3 server in the cluster, Version: 1.7.0
each 16GB RAM (quota for mambase is 12GB), AMD Opteron(tm) Processor 250
Bucket Type: Membase

everything is working just fine, but if we get heavy load i.e. about 140 ops per second, everything is going crazy....
the clients get mass of errors like:
- I'm not responsible for this vbucket (7)
- Exception waiting for value
- Operation canceled because authentication or reconnection and authentication has taken more than one second to complete
- Timed out waiting for operation - failing node: xyz
- Reconnecting due to exception on {QA sa=xyz:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=4}
java.lang.ClassCastException: net.spy.memcached.MemcachedClient$6 cannot be cast to net.spy.memcached.ops.GetOperation$Callback
- WARNUNG: Closing, and reopening {QA sa=xyz:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=4}, attempt 0.
- Discarding partially completed op: net.spy.memcached.protocol.binary.StoreOperationImpl@74a5eac1
- etc

thats very frustrating...
any suggestions?

Top
  • Login or register to post comments
Thu, 07/07/2011 - 16:27
perry
Offline
Joined: 10/11/2010
Groups:

With all due respect, 140 ops/sec would not be considered high load ;-)

Obviously something is not working properly, let's try and figure out which side it's on (client or server).

When these exceptions start happening, do you see any log messages in the Membase logs indicating that something crashed or restarted?

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Thu, 07/07/2011 - 16:34
a.reiter
Offline
Joined: 10/13/2010
Groups: None

hi Perry,
thx a lot for the reply, indeed 140 ops/sec is not much, its just a relative high load compared to the aerlier requests... :-)

unfortunatelly there is nothing on the membase server, i can see, no log entries...
BUT, i have found on one of the three membase servers this messages in the "dmesg"
we running Linux version 2.6.32-5-amd64 (Debian 2.6.32-34squeeze1)

dmesg cutout:

[654780.036028] Northbridge Error, node 0
[654780.036087] K8 ECC error.
[654780.036133] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x1f6c69008
[654780.036195] EDAC MC0: CE page 0x1f6c69, offset 0x8, grain 0, syndrome 0x1, row 3, channel 1, label "": amd64_edac

there is a lot of such entries in the dmesg, so it seems, that one of the RAM modules is dead or the northbridge
however, there is a hardware error, i have to check this first, may be that is the reason for our problems...

best regards
andre

Top
  • Login or register to post comments
Thu, 07/07/2011 - 16:54
perry
Offline
Joined: 10/11/2010
Groups:

Certainly wouldn't be helping things :-)

Now that it seems like the Membase servers themselves are working (at least the software is), we can focus on the client side a bit. Would you be able to gather a packet capture when you start seeing these errors? I'm not a Java expert, but one thing I do know is that it's possible to overload the heap on the client side...are you properly checking for return values and ensuring that the client is functioning properly?

Thanks

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Thu, 07/07/2011 - 23:40
a.reiter
Offline
Joined: 10/13/2010
Groups: None

Usually one can see some exception, if there is a heap overload, s.th like "OutOfMemoryError: Java heap space"
but there are no such errors on our clients
just in case i increased the max heap size with "-Xmx2000m", will post the results after the next load peak...

thanks
andre

Top
  • Login or register to post comments
Fri, 07/08/2011 - 01:00
a.reiter
Offline
Joined: 10/13/2010
Groups: None

still getting warnings like

net.spy.memcached.protocol.binary.MultiGetOperationImpl finishedPayload
WARNUNG: Error on key xyz: I'm not responsible for this vbucket (7)

just out of my understanding...
sometimes i try to get a key, and get no value from cache, where the warning is shown, that the server is not responsible for the vbucket

one minute later the exactly same code does work, so there is a value returned without any warnings, wtf??? :-(

why? there is not much load on the membase cluster, where do that messages come from? why is sometimes no value behind the key
the membase logs do not contain anything...

Top
  • Login or register to post comments
Fri, 07/08/2011 - 02:56
a.reiter
Offline
Joined: 10/13/2010
Groups: None

indeed, it seems to be an issue on the client site
under "havy load" (150 ops/sec, i know, its not much) we see this errors, on our client

08.07.2011 11:14:52 net.spy.memcached.protocol.TCPMemcachedNodeImpl setupResend
WARNUNG: Discarding partially completed op: net.spy.memcached.protocol.binary.StoreOperationImpl@5b7a70ec
08.07.2011 11:14:52 net.spy.memcached.protocol.TCPMemcachedNodeImpl setupResend
WARNUNG: Discarding partially completed op: net.spy.memcached.protocol.binary.GetOperationImpl@438bc4df
08.07.2011 11:14:52 net.spy.memcached.protocol.TCPMemcachedNodeImpl setupResend
WARNUNG: Discarding partially completed op: net.spy.memcached.protocol.binary.GetOperationImpl@d2a2f1e
08.07.2011 11:14:52 net.spy.memcached.MemcachedConnection handleIO
INFO: Reconnecting due to exception on {QA sa=mb1/172.30.0.42:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=4}
java.lang.ClassCastException: net.spy.memcached.MemcachedClient$6 cannot be cast to net.spy.memcached.ops.GetOperation$Callback
	at net.spy.memcached.protocol.ProxyCallback.addCallbacks(ProxyCallback.java:25)
	at net.spy.memcached.protocol.binary.OptimizedGetImpl.addOperation(OptimizedGetImpl.java:28)
	at net.spy.memcached.protocol.binary.OptimizedGetImpl.<init>(OptimizedGetImpl.java:21)
	at net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl.optimizeGets(BinaryMemcachedNodeImpl.java:46)
	at net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl.optimize(BinaryMemcachedNodeImpl.java:35)
	at net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNodeImpl.java:196)
	at net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:468)
	at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:430)
	at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:280)
	at net.spy.memcached.MemcachedClient.run(MemcachedClient.java:2063)

this initiates a reconnect, during the reconnect all requests are cancelled
this scenario repeats periodically, so we see a lot of error on the client site...
how is that possible, that there is a ClassCastException???
anybody a suggestion?

Top
  • Login or register to post comments
Fri, 07/08/2011 - 04:05
a.reiter
Offline
Joined: 10/13/2010
Groups: None

strange...

the errors occured while we were using the "default" bucket, now we have created a new one
using this new bucket seems to work very stable without any errors... is that strange?

edit:
ok, the errors are still there, even with the new "not default" bucket...

Top
  • Login or register to post comments
Fri, 07/08/2011 - 12:14
ingenthr
Offline
Joined: 03/16/2010
Groups:

Are you doing a getBulk by chance? A recent bug was found whereby using getBulk was incorrectly selecting the wrong nodes. We should have a fix shortly and can supply a build for you to verify.

Other questions:
- How are you initializing your client? Can you post the constructor in use?
- Is there any chance that you have high latency links between your clients and servers?

Top
  • Login or register to post comments
Fri, 07/08/2011 - 12:46
a.reiter
Offline
Joined: 10/13/2010
Groups: None

hi ingenthr
thx for the reply

here is the code, how we construct the MemcachedClient

ArrayList<URI> baseURIs = new ArrayList<URI>();
// these are three servers of our membase cluster
String[] membaseServer = new String[] {"http://membase1:8091/pools", "http://membase2:8091/pools", "http://membase13:8091/pools"};
for (String m : membaseServer) {
  baseURIs.add(new URI(m));
}
MemcachedClient mcConfig = new MemcachedClient(baseURIs, "bucket1", "");

we are not usung the "getBulk" method

the 3 servers membase(1-3) are at the same time our three web servers behind a load balancer, so they are at the same time the memcached clients

all servers are connected directly over a switch using 1000 Mbps full duplex

Top
  • Login or register to post comments
Fri, 07/08/2011 - 13:02
ingenthr
Offline
Joined: 03/16/2010
Groups:

Hm. Can you perhaps describe what methods you do use and the workload a bit? The specific fix for the getBulk() can affect other operations.

Also, please confirm that the cluster is healthy in the admin Web UI.

One other thing just to confirm, grab the output of /pools/default/buckets/bucket1 from each of the three nodes, send it through a JSON formatter, and then diff them. They should be exactly the same in terms of configuration, but may have minor differences in stats at any given point in time. There's no reason for these to be different, but the client implicitly trusts the first node to respond and there could be an issue at the cluster level.

Top
  • Login or register to post comments
Fri, 07/08/2011 - 13:27
a.reiter
Offline
Joined: 10/13/2010
Groups: None

the method we are using are: get and set, additionally we use a CASMutator object for a cas set method
actually nothing special...
the amount of data in every item is with about 2k not too big...

the Web UI looks perfekt, logs do not contain any warnings or s.th. like that, all servers are green

to the three files /pools/default/buckets/bucket1:
the values of "cpu_utilization_rate" are a bit different, but everything else is the same

so it seems, the cluster is OK
i'm pretty sure, the client is the problem, ClassCastException should actually never be seen...
there is an issue still not resolved: http://code.google.com/p/spymemcached/issues/detail?id=96&colspec=stars%...

Top
  • Login or register to post comments
Thu, 07/14/2011 - 00:57
a.reiter
Offline
Joined: 10/13/2010
Groups: None

the bug with ClassCastException is finally fixed in version 2.7.1, see issue http://code.google.com/p/spymemcached/issues/detail?id=96
now the clients are running so far without reconnecst and any problems

Top
  • Login or register to post comments
Thu, 07/14/2011 - 01:00
ingenthr
Offline
Joined: 03/16/2010
Groups:

Ah, you're Andre R! Yes, we've come up with a fix for that in the last couple of days. Thanks for letting us know. I'd not immediately spotted the same issue there. Glad things are working well for you now.

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker