.Net SDK - Freeze web application
We have a web application that uses .Net SDK of Couchbase. Although more or less it works OK, we have situations that possibly to .Net component crash (or something similar like socket pool problems) the application freezes and we have to recycle it in order to be alive again.
The pattern is that we see a drop on Couchbase read/writes to 0 with very low CPU usage to w3wp.exe (0%-1%).
We have enable logging to Enyim.Caching.Memcached and we constantly see errors like these:
2012-12-09 00:54:56.9632|ERROR|Enyim.Caching.Memcached.MemcachedNode|System.IO.IOException: Failed to read from the socket '127.0.0.1:11210'. Error: ConnectionAborted
at Enyim.Caching.Memcached.PooledSocket.BasicNetworkStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.BufferedStream.Read(Byte[] array, Int32 offset, Int32 count)
at Enyim.Caching.Memcached.PooledSocket.Read(Byte[] buffer, Int32 offset, Int32 count)
at Enyim.Caching.Memcached.Protocol.Binary.BinaryResponse.Read(PooledSocket socket)
at Enyim.Caching.Memcached.Protocol.Binary.BinarySingleItemOperation.ReadResponse(PooledSocket socket)
at Enyim.Caching.Memcached.MemcachedNode.ExecuteOperation(IOperation op)
2012-12-09 00:54:57.3220|ERROR|Enyim.Caching.Memcached.MemcachedNode|System.IO.IOException: Failed to read from the socket '127.0.0.1:11210'. Error: ConnectionReset
at Enyim.Caching.Memcached.PooledSocket.BasicNetworkStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.BufferedStream.Read(Byte[] array, Int32 offset, Int32 count)
at Enyim.Caching.Memcached.PooledSocket.Read(Byte[] buffer, Int32 offset, Int32 count)
at Enyim.Caching.Memcached.Protocol.Binary.BinaryResponse.Read(PooledSocket socket)
at Enyim.Caching.Memcached.Protocol.Binary.BinarySingleItemOperation.ReadResponse(PooledSocket socket)
at Enyim.Caching.Memcached.MemcachedNode.ExecuteOperation(IOperation op)
Just before application freeze we see an error like this:
2012-12-10 10:57:36.7376|ERROR|Enyim.Caching.Memcached.MemcachedNode|System.IO.IOException: Failed to read from the socket '127.0.0.1:11210'. Error: ?
at Enyim.Caching.Memcached.PooledSocket.BasicNetworkStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.BufferedStream.Read(Byte[] array, Int32 offset, Int32 count)
at Enyim.Caching.Memcached.PooledSocket.Read(Byte[] buffer, Int32 offset, Int32 count)
at Enyim.Caching.Memcached.Protocol.Binary.BinaryResponse.Read(PooledSocket socket)
at Enyim.Caching.Memcached.Protocol.Binary.BinarySingleItemOperation.ReadResponse(PooledSocket socket)
at Enyim.Caching.Memcached.MemcachedNode.ExecuteOperation(IOperation op)
What is the "?" on the error? We suspect that this is an exceptional error that .Net SDK can't recover after that.
Do you have any suggestion what we can do in order to eliminate this behavior?
OS: Windows 2008 Server.
The current volume is 500-1000 ops per second. We removed some caching from Couchbase and moved it to ASP.Net caching because we suspected that too many ops per second make the problem appears faster. Before we had about 2K-3K ops per second.
We use just one CouchbaseClient accross the entire application, stored on an Application variable.
Yesterday we installed New Relic service and today i saw this Exception before the problem occured:
[SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full 127.0.0.1:8091]
at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress\ socketAddress)
at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket\ s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult\ asyncResult, Exception& exception)
[WebException: Unable to connect to the remote server]
at System.Net.WebClient.DownloadDataInternal(Uri address, WebRequest& request)
at System.Net.WebClient.DownloadString(Uri address)
at Couchbase.ConfigHelper.DeserializeUri[T](WebClient client, Uri uri, IEnumerable`1\ converters)
at Couchbase.ConfigHelper.ResolveBucket(WebClient client, Uri bootstrapUri,\ String name)
at Couchbase.BucketConfigListener.ResolveBucketUri(WebClientWithTimeout client,\ Uri root, String bucketName)
The question is why a socketException brings down the client. And if 5000-1000 ops are too much (i think no).
Some new exceptionw caught by NewRelic regarding Magic values:
System.InvalidOperationException: Expected magic value 129, received: 61
Stack trace
[InvalidOperationException: Expected magic value 129, received: 61]
at Enyim.Caching.Memcached.Protocol.Binary.BinaryResponse.DeserializeHeader(Byte[]\ header, Int32& dataLength, Int32& extraLength)
at Enyim.Caching.Memcached.Protocol.Binary.BinaryResponse.Read(PooledSocket\ socket)
at Enyim.Caching.Memcached.Protocol.Binary.BinarySingleItemOperation.ReadResponse(PooledSocket\ socket)
at Enyim.Caching.Memcached.MemcachedNode.ExecuteOperation(IOperation op)
at Enyim.Caching.Memcached.MemcachedNode.Enyim.Caching.Memcached.IMemcachedNode.Execute(IOperation\ op)
at Couchbase.CouchbaseClient.ExecuteWithRedirect(IMemcachedNode startNode,\ ISingleItemOperation op)
at Couchbase.CouchbaseClient.PerformTryGet(String key, UInt64& cas, Object&\ value)
at Enyim.Caching.MemcachedClient.ExecuteTryGet(String key, Object& value)
at Enyim.Caching.MemcachedClient.ExecuteGet(String key)
at ASP.default_aspx.GetDefaultSEOValuesCache(Int64 iType)
at ASP.default_aspx.ProcessInlineTag(String tag, Shop myShop)
at ASP.default_aspx.__Render__control1(HtmlTextWriter __w, Control parameterContainer)
at System.Web.UI.Control.RenderChildrenInternal(HtmlTextWriter writer, ICollection\ children)
at System.Web.UI.Page.Render(HtmlTextWriter writer)
at System.Web.UI.Control.RenderControlInternal(HtmlTextWriter writer, ControlAdapter\ adapter)
at System.Web.UI.Control.RenderControl(HtmlTextWriter writer)
at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint,\ Boolean includeStagesAfterAsyncPoint)
[InvalidOperationException: Expected magic value 129, received: 60]
at Enyim.Caching.Memcached.Protocol.Binary.BinaryResponse.DeserializeHeader(Byte[]\ header, Int32& dataLength, Int32& extraLength)
at Enyim.Caching.Memcached.Protocol.Binary.BinaryResponse.Read(PooledSocket\ socket)
at Enyim.Caching.Memcached.Protocol.Binary.BinarySingleItemOperation.ReadResponse(PooledSocket\ socket)
at Enyim.Caching.Memcached.MemcachedNode.ExecuteOperation(IOperation op)
at Enyim.Caching.Memcached.MemcachedNode.Enyim.Caching.Memcached.IMemcachedNode.Execute(IOperation\ op)
at Couchbase.CouchbaseClient.ExecuteWithRedirect(IMemcachedNode startNode,\ ISingleItemOperation op)
at Couchbase.CouchbaseClient.PerformTryGet(String key, UInt64& cas, Object&\ value)
at ASP.searchshop2_aspx.GetShopProductMiscImagesCache(Int64 ProductId, Shop\ myShop)
at ASP.searchshop2_aspx.__Render__control1(HtmlTextWriter __w, Control parameterContainer)
at System.Web.UI.Control.RenderChildrenInternal(HtmlTextWriter writer, ICollection\ children)
at System.Web.UI.Page.Render(HtmlTextWriter writer)
at System.Web.UI.Control.RenderControlInternal(HtmlTextWriter writer, ControlAdapter\ adapter)
at System.Web.UI.Control.RenderControl(HtmlTextWriter writer)
at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint,\ Boolean includeStagesAfterAsyncPoint)
Are you using the latest version of the Couchbase Client (1.2)? 1.2 Beta-3 and the GA release have fixes that prevent the client from crashing in certain situations.
The magic value error you're seeing is a reported issue with the Enyim.Caching dependency. We're working to get a patch into our fork and release that ASAP. That's issue NCBC-170 as noted in the release notes.
http://www.couchbase.com/docs/couchbase-sdk-net-1.2/couchbase-sdk-net-rn...
I installed the latest version (1.2) and seems things are much better now.
I keep getting quite a few CONNECTIONABORTED and CONNECTIONRESET errors from Enyim.Caching logging. Why is this happening on a local server with about 500-1000 ops per second?
Also i noticed some crashes again but not so many as before. 2-3 times a days now.
Note that i have implement some kind of "write locking" on my app. Is the .Net client thread safe by design or do I have to implement locking on my application for sure? I noticed that after locking implementation on write things went better. Just an observation.
I will monitor things and report back in a few days.
The client should be thread safe by design, but there have been reports (such as yours) that suggest there is a bug somewhere that needs to be addressed. I'm currently focused on trying to track down these bugs. If you have some repro code you could share, I'm happy to take a look. Just sent to john at couchbase dot com. One approach that seemed to remedy these symptoms for one user was to drop the minPoolSize and maxPoolSize down to 1 and create a client per HttpApplication instance instead of a single static client.
john, what are the default pool sizes?
Can you set these in the web.config section or only in code like this?:
CouchbaseClientConfiguration config = new CouchbaseClientConfiguration(); config.SocketPool.MaxPoolSize = 1; config.SocketPool.MinPoolSize = 1; CouchbaseClient client = new CouchbaseClient(config);
In the web config (child of :
Defaults are: min="10" max="20"
john, I think you left something out of your post above.
Is there an XSD of the web.config section for couchbase? Or are all the options somewhere in the API documentation?
Oops - the XML in my comment was hidden in the HTML.
<couchbase>
<servers>
...
</servers>
<socketPool minPoolSize="10" maxPoolSize="20" />
</couchbase>I'm currently working on updating the docs - should be out soon. There are some notes at http://www.couchbase.com/docs/couchbase-sdk-net-1.2/couchbase-sdk-net-co..., but it needs more details.
Repro code is not very easy as there is not standard behavior.
Another exception i think must be handled internally:
[NullReferenceException: Object reference not set to an instance of an object.]
at Couchbase.CouchbasePool.Enyim.Caching.Memcached.IServerPool.Locate(String\ key)
at Couchbase.CouchbaseClient.PerformTryGet(String key, UInt64& cas, Object&\ value)
at Enyim.Caching.MemcachedClient.ExecuteTryGet(String key, Object& value)
at Enyim.Caching.MemcachedClient.ExecuteGet(String key)
at ASP.default_aspx.GetDefaultSEOValuesCache(Int64 iType)
at ASP.default_aspx.ProcessInlineTag(String tag, Shop myShop)
at ASP.default_aspx.__Render__control1(HtmlTextWriter __w, Control parameterContainer)
at System.Web.UI.Control.RenderChildrenInternal(HtmlTextWriter writer, ICollection\ children)
at System.Web.UI.Page.Render(HtmlTextWriter writer)
at System.Web.UI.Control.RenderControlInternal(HtmlTextWriter writer, ControlAdapter\ adapter)
at System.Web.UI.Control.RenderControl(HtmlTextWriter writer)
at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint,\ Boolean includeStagesAfterAsyncPoint)
I continue having problems with .Net component. At random times it crashes and there is no workaround other than recycling the Web app.
I tried instantiate a new instance of the component when errors occur but although i create a new object, problems still apply.
I did the following:
1. Place all couchbase operations into try catch blocks.
2. When I have an exception or when i'm starting having problems (failed reads or writes) I create a new version of the component (like a new, fresh initialization).
3. Although a new object is created, the same problems continue, can't read or write to Couchbase (possible of some socket pool problems).
4. The only workaround is to recycle the application pool. I have made a program that monitors for Couchbase errors (based on logging on my Web app) and recycles automatically the application pool.
Some questions:
1. Why the new object creation does not fix the problem?
2. Can i do anything to initialize the object correctly and continue without recycling my app?
3. I think these kind of problems are somehow not acceptable for a release version and have to be addressed as soon as possible. I wish i could help more.
Regards
John
When you did step 2. from above, did you dispose of the existing instance of CouchbaseClient?
Also I noticed that it seems like your Couchbase server is on the same physical machine as your web app? Perhaps you are having issues with firewall/virus protection software.
I just make it null, do I have to call Dispose() and for what reason?
There is no firewall for local connections or any antivirus software. If there was a firewall nothing would work on the first place.
Well just in general, if any .Net class impliments IDisposable, you should call the Dispose() method. Setting to "null" doesn't really do anything at all, except in some cases help the garbage collector if the object is on the large object heap (just a fun fact, not a factor in your case).
I'm guessing that this Dispose() method does things like clean up the connection pool etc. Might help your problem.
The null reference exception when trying to locate a node that is being thrown implies that something went wrong when the client tried to perform a handshake with the cluster. New object creation wouldn't fix the issue if there's a problem connecting to the server.
Additionally, there are resources maintained by the client that are not fully disposed unless you call dispose explicitly. Garbage collection will trigger the release of unmanaged resources via the destructor as well.
It does seem like the client can't reach the server. If it's all on the same box, with 5k-10k ops/second, you could certainly be saturating network resources. Specifically, a problem that often occurs with Windows and Couchbase under high load is there aren't enough ephemeral ports available. By default, Windows provides 5k or so. It would be useful to increase that number significantly to see what happens.
As i said previously ops/sec are about 1000. Ephemeral ports are about 60k (as i see on the registry).
I suspect something goes wrong with the socket pool but i can't understand why i never had similar problems with products like SQL Server that can handle much more connections on the same box without any socket issues.
And i can't also undestand why an application pool recycle fixes the problem. Any thoughts?
If you have any sample code that you could share, that would be helpful. Are you able to reproduce the error on any server you setup the app on? Is this live production only?
Some other things to try would be trying the tweak the various connection timeouts described here - http://www.couchbase.com/docs/couchbase-sdk-net-1.2/couchbase-sdk-net-co....
Unfortunately i can't share any code as this is happening on production machines with a lot of traffic.
Actually i have used Couchbase on 4 different projects and i get the same errors on ALL of them.
The difference is the probability. As soon as I have more than 1000 ops/sec and lot of traffic (so many connections from users to web server) problems begin.
I had to switch part of my caching to internal ASP.Net caching in order to minimize load to Couchbase and having less problems (about 6-7 times a day). I like out of process caching but i can't stand so many problems.
I will try to change the queueTimeout (increase it).
The receiveTimeout has default value 100ms or 10s? I am confused reading the docs:
receiveTimeout (00:00:10) The amount of time after which receiving data from the socket fails. The default is 100 msec.
Sorry, 10s is the correct value. I've sent a fix to our docs team.
After installing Couchbase 2.0, it seems that many of the problems have gone.
Although we get some .Net client errors (much less than with previous version) we haven't seen application lockup.
Unfortunatelly we have agian application crashes. Please fix this unhandled error:
An unhandled exception occurred and the process was terminated.
Application ID: /LM/W3SVC/3/ROOT
Process ID: 12420
Exception: System.NullReferenceException
Message: Object reference not set to an instance of an object.
StackTrace: at Hammock.RestClient.CompleteWithQuery(WebQuery query, RestRequest request, RestCallback callback, WebQueryAsyncResult result)
at Hammock.RestClient.<>c__DisplayClass18.b__15(Object sender, WebQueryResponseEventArgs args)
at System.EventHandler`1.Invoke(Object sender, TEventArgs e)
at Hammock.Web.WebQuery.OnQueryResponse(WebQueryResponseEventArgs args)
at Hammock.Web.WebQuery.HandleWebException(WebException exception)
at Hammock.Web.WebQuery.GetAsyncResponseCallback(IAsyncResult asyncResult)
at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
at System.Threading.ExecutionContext.runTryCode(Object userData)
at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Net.ContextAwareResult.Complete(IntPtr userToken)
at System.Net.HttpWebRequest.SetResponse(Exception E)
at System.Net.ConnectionReturnResult.SetResponses(ConnectionReturnResult returnResult)
at System.Net.Connection.CompleteConnectionWrapper(Object request, Object state)
at System.Net.PooledStream.ConnectionCallback(Object owningObject, Exception e, Socket socket, IPAddress address)
at System.Net.ServicePoint.ConnectSocketCallback(IAsyncResult asyncResult)
at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
at System.Net.ContextAwareResult.Complete(IntPtr userToken)
at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)
This is a known issue (noted here http://www.couchbase.com/docs/couchbase-sdk-net-1.2/couchbase-sdk-net-rn...). There's a fix in place with the code on GitHub. I've just attached a signed verification build to the Jira issue - http://www.couchbase.com/issues/browse/NCBC-172.
When do you anticipate the next release of client library will be (which includes this bug fix)?
Also, did I see correctly that you are going from Hammock to RestSharp library for RESTFUL API library?
Thanks for all your hard work!
The next release (1.2.1) is planned for 2/5. You can see the roadmap here - http://www.couchbase.com/issues/browse/NCBC#selectedTab=com.atlassian.ji.... And yes, that will use RestSharp for view access instead of Hammock - though Hammock will still be supported as an alternative
You're welcome!
What type of volume before these exceptions happen? How many times per day?
Are you reusing your CouchbaseClient() across sessions? Or opening and then disposing on each use?