Storing a value sometimes fails
We are using Windows all the way around, specifically Windows Server 2008 Web Edition, 64-bit. Our client uses Enyim.
Most of the time when we store a value using the Store method, the value stores successfully (true is returned from the method call). Occasionally, however, a value that should store successfully does not (false is returned).
I can put a break point in my code to break when the value fails to store, then try again, and it fails, then try again, and it fails, and then perhaps on the 4th or 5th time it succeeds, and then I can let execution of the code continuing and storing is then successful for awhile.
Sometimes (even more infrequently), storing the value will continually fail (when I try over and over by stepping through the code).
Questions:
Why is this happening?
Under what conditions would storing a value fail?
What is the best pattern to follow when storing a value if it fails, have a retry loop of some kind to retry for up to a number of seconds or something?
Thanks.
-Corey
I will try to gather some additional information. In the meantime, here is a bit more information.
There are 11 servers in the Cluster and the bucket is only 704 MB (64 MB per server).
The call essentially returns immediately. But like I said, I can right there while stepping through the code go back and execute the same statement a few more times and eventually it will work.
This happens very infrequently. It can take 15-30 minutes before it happens again.
I've attached some screenshots to show what's going on when I look at the dashboard.
-Corey
[IMG]http://www.mozenda.com/resources/debugging/northscale/analytics-1.png[/IMG]
[IMG]http://www.mozenda.com/resources/debugging/northscale/analytics-2.png[/IMG]
[IMG]http://www.mozenda.com/resources/debugging/northscale/analytics-3.png[/IMG]
Thanks Corey, certainly doesn't look like any extreme load on the cluster, nor does anything look out of the ordinary.
Perry
Hi Perry,
Sorry for butting in half way, but reading about this I was curious as to what you'd consider to be a 'high' load on the cluster. In our production set up right now, we're running 4 single core Amazon EC2 instances as cache servers running two buckets - one with 128Mb per server and the other 512Mb per server. The load varies through out the day but usually in the range between 500 to 900 ops/sec (roughly 4-1 gets vs. sets), that's the total across the two buckets, and the CPU usages on the servers average around 35%.
We do get the odd failure to store cache item but we have built in some mechanism to handle that, i.e. retry a number of times, and then re-instantiate the client if the problem persists. And as I mentioned on my other thread ([url]http://forums.northscale.com/showthread.php?439-Issue-with-Enyim-client-version-2.4[/url]) re-instantiating the client solves the problem with the 2.0B2 version of the Enyim NorthClientClient. But as to the root cause of the failures, do you think it's because we're overly utilizing the cache servers we've got? Overly all though, we're happy with the response time we get from the cache servers but would love to hear what your thoughts are on a sensible level of utilization.
Thanks,
"high" load is very dependent on your own application and power of your underlying systems. In general though, 500-900 ops/sec should not be taxing to your setup. I don't want to be much more specific simply because it's a case-by-case analysis. I've seen much higher loads, but factors like key size, number of connections, etc make a blanket statement hard to make.
Hope that helps. In terms of diagnosing any failures, it would be helpful for me to see a packet capture of a store failing. I'm a hardcore network guy, so looking at what is actually happening on the wire really helps me understand all the layers on top.
Thanks
Perry
Perry,
We solved the problem. I think it is a bug in NorthScaleClient. Or at least it is an ambiguous behavior.
One of the NorthScaleClient constructors takes as parameters an INorthScaleClientConfiguration derived class and a bucket name. Our INorthScaleClientConfiguration derived class was returning an empty string for the Bucket property because we were using the same instance of the class as configuration to all client connections (to different buckets) and we were providing the bucket name as the second parameter to the NorthScaleClient constructor to identify the bucket to which we wanted the client to connect. We figured this is what the second parameter was for, so that we could share the same configuration with all client connections and all buckets.
It turns out we were not only seeing some items fail to store, but some items were not being stored to the correct bucket! In fact, they were stored to a seemingly random bucket.
We solved the problem by creating a separate instance of the INorthScaleClientConfiguration derived class so that the Bucket property could return the proper bucket name (that matched what we passed into the constructor). This entirely solved both problems.
I hope this helps someone else.
-Corey
Thanks very much for that Corey!
Perry
Corey, what happens if instead of returning an empty string you return null in your derived class?
Also, the custom INorthScaleClientConfiguration should return a new locator instance every time the `CreateNodeLocator` is called. Was your implementation doing this, or sharing the locator between clients?
Perry
Corey, the best course of action is obviously to find out why it is failing and do whatever possible to prevent that from happening (fixing a bug, etc). Aside from that, I think it is also warranted to have at least some retry logic that attempts to store the key a fixed number of times (whatever is acceptable for your application...balancing the absolute need to have that data stored with your performance needs). As far as I know, there is no reason that a value should not be stored (in memcached) under normal circumstances, but I will investigate for any specific failure scenarios that may cause this.
That being said, lets tackle the specific issue at hand. Can you use wireshark or tcpdump to gather a packet capture when you encounter a value that will not store for a few times? Has it been verified that the server itself is failing to store or whether it's in the client logic? Do you know whether "false" is being returned immediately or is the store timing out after some time?
Do you have a relatively simple code sample that can be used to reproduce this?
Thanks.
Perry
Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Membase: http://www.membase.com/products-and-services/overview
Call or email "sales -at- membase -dot- com" today!