Summary
We have had some problems with our application after the application pool started or recycled. When the application pool started or recycled under load it became completely frozen. Some information about our application:
- Couchbase SDK: CouchbaseNetClient 2.4.8 (at the time)
- .NET framework: .NET 4.6.1 (64-bit)
- Web framework: ASP.NET Core
- OS: Windows Server 2012 R2
- Web server: IIS/Kestrel
The application is an identity and access management application which handles over a million user authentications daily and is continually growing. This results in over 500 requests per second at peak times. We use a load balancer to share the load across multiple IIS web servers which in turn pass requests to the Kestrel process.
To determine what the problem was we analyzed our logging events (we use extensive logging of all dependent services, like Couchbase and SQL Server for example, using syslog messages) and made process dumps (which of course can be shared with you if requested) which we analyzed using WinDbg. We also asked Microsoft for help from an escalation engineer who also analyzed the dumps using WinDbg. I will describe the findings we made, what we did to prevent this and how we think it can be fixed permanently.
The problem
We use lazy initialization of the Couchbase Cluster and IBucket objects, which means that initialization happens upon first use instead of during application initialization. Upon first use a Cluster is created, which is stored and reused throughout the application lifetime, then OpenBucket is called. The resulting IBucket from the OpenBucket call is also stored and reused throughout the application lifetime. This logic is similar to the ClusterHelper logic.
When our application starts it immediately receives 100+ requests per second which mostly depend on Couchbase, thus running into the Couchbase initialization through calling OpenBucket (as the IBucket is not yet created and stored for later use). In OpenBucket a lock statement ensures that bucket initialization only happens once, so the first thread obtains the lock and the other threads must wait until it is released. The first thread starts initializing the bucket which will eventually call the UriExtensions.GetIpAddress method (as we use hostnames for our Couchbase nodes). Inside the UriExtensions.GetIpAddress method the asynchronous method Dns.GetHostEntryAsync is called and, because the calling method is not asynchronous, the Result property is used to obtain the resulting host entry. This is where it starts to go south.
The resulting task from the Dns.GetHostEntryAsync method, although called in a synchronous context, needs another thread to execute on. Because the thread pool is just fresh and still growing, all threads are in use and waiting for the lock inside the OpenBucket method. Currently there is no thread to execute the Dns.GetHostEntryAsync task. Every time the thread pool decides to expand the available threads, all newly available threads are immediately used by Kestrel for executing other queued requests. These new threads will also run into bucket initialization eventually and end up waiting for the lock statement inside the OpenBucket method. The result is an application which is completely locked.
Possible fixes
We have made a temporary fix in a wrapper method around the OpenBucket method which uses a SemaphoreSlim for locking during bucket initialization (which of course also can be shared with you if requested). Because our application is mostly asynchronous we can use the SemaphoreSlim.WaitAsync method which, when used in an asynchronous context, will return the waiting thread to the thread pool while waiting for the lock. Because these threads are returned to the thread pool there is an available thread for executing the task from the Dns.GetHostEntryAsync method.
As said, this is just a temporary fix. The real fix would be to use the synchronous Dns.GetHostEntry method, but this method is only available in .NET Standard 2.0 and above which would result in another target framework. Another approach would be to make bucket initialization asynchronous all the way by providing an OpenBucketAsync method. Pull request NCBC-1549 by @dlemstra, a colleague, is a first step towards providing the asynchronous creation of buckets and contains essentially the fix we made in our wrapper method mentioned earlier.
This would fix the problem in asynchronous contexts, but doesn’t prevent that the Dns.GetHostEntryAsync method has to be used in the synchronous OpenBucket method. It seems that the use of asynchronous methods in a synchronous context happens quite a lot in the CouchbaseNetClient library, but Dns.GetHostEntryAsync is the only one that has given us serious trouble (until now…).