Runaway worker process when cluster node shutdown - SessionState provider

Hi All,

I am experiencing a problem that has multiple symptoms and all in all seems very strange.

I have a single Windows Server 2012, IIS8 EC2 instance hosting an ASP .Net WebForms site using .Net Fx 4.0. This is configured to use a Couchbase 2.1.1 community edition (build-764-rel) cluster of 2 nodes.

The Client components are: CouchBaseNetClient 1.2.9 and CouchBaseAspNet 1.2.1 both installed from NuGet.

The Problem:

If I shutdown one of the nodes (either one, same problem) to simulate failure; then the following are observed.

The connected browser clients get redirected to a 404 page when they try to postback.

If I re-enter the home page url in the browser and navigate there, the server does not repsond to the get request and the server's w3wp process jumps to 50% and just sits there. If I browse to the site from another client, the w3wp processor usage jumps to 100% - eventually these requests timeout on the browser but the processor use remains the same on the server.

Stopping and starting the app pool seems to have no effect, a full iisreset is required. When I then access the site again after an iisreset, any pages that use session state timeout.

The system cannot be restored to working condition until the dropped Couchbase node is returned to the cluster.

This problem happens when exclusiveAccess is set to true or false.

The web.confg entries are included below:

Web.config:

<sessionState customProvider="Couchbase" mode="Custom" timeout="10">
			<providers>
				<add name="Couchbase" type="Couchbase.AspNet.SessionState.CouchbaseSessionStateProvider, Couchbase.AspNet" 
					 exclusiveAccess="false" />
			</providers>
		</sessionState>
	</system.web>
	<couchbase>		
		<servers bucket="default" bucketPassword="passwd">
			<add uri="http://cache1.domain.com:8091/pools" />
			<add uri="http://cache2.domain.com:8091/pools" />
		</servers>
	</couchbase>
<code>

Hi mr_robd_lon -

Thanks for the very detailed description of your situation! Unfortunately I don't have a an easy answer here because it seems specific to your deployment environment/arch, I am going to create a jira ticket so that one of our QE engineers can try to replicate this: https://www.couchbase.com/issues/browse/NCBC-314

Thanks,

Jeff

Hi, just wondering whether anybody had a chance to look into this issue. I need to make a call on whether to implement an alternative session state provider.

mr_robd_lon -

We are still looking into it - my guess is that the issue is more related to the CouchbaseClient than the actual session state provider (which uses the client).

Also, the DNS/elastic IP setup _may_ be playing a role here as well.

Any chance you have a stacktrace or any log data to go with the issue?

Excellent! Thanks for the feedback, if it's being actively investigated I'll hold for a bit longer. As a newcomer to CouchBase I've been so impressed by the ease of setup and administration I'd like to be able to keep it as part of the solution.

It might not be the case that we absolutely need a fix, but if we can confirm that the cause is the network setup of the nodes then I can motivate to get them into an Amazon VPC.

If it helps I can arrange access to the EC2 instances involved, not sure if you've been able to replicate it yet.

I have set up a test environment for this that is on my local network in the office to eliminate any Amazon related network weirdness.

I am now able to Failover and Remove a node without seeing the above problem. However, I can still replicate the exact problem if I stop the network interface on one of the nodes or shut the machine down. If I restart my application while 1 of the 2 nodes is down I also see a similar behaviour.

I am not familiar enough with the code to determine the exact cause what I can say is that the line of code where the client is hanged is:

Worker --> ProcessPool -->
MessageStreamListener. private void ReadMessages(Uri heartBeatUrl, Uri configUrl)
...
while ((line = reader.ReadLine()) != null)
{
  ...
}

It's just spinning it's wheels in reader.ReadLine. It never gets inside the loop once the problem has started. It seems as though an additional check is necessary on the reader, something like reader.BaseStream.CanRead?

I think that because it is stuck listening for messages, the rest of the client isn't able to properly deal with the missing node.

I posted this SO question about the problem: http://stackoverflow.com/questions/20329075/handling-streamreader-readli...

The suggestion that there is no Read Timeout set on the Stream seems quite reasonable

mr_robd_lon -

Thanks for digging deeper into this, it looks like you isolated the cause and the section of code that is failing. I am not sure the readTimeout will solve the problem though...we'll try it out using the scenario described above.

The client is open source, so feel free to try it yourself and if it works as expected with no regressions, submit a pull request via github :)

-Jeff

1 Answer

« Back to question.

Hi,

Thank you very much for prioritising this.

I'm not sure if this has any relevance, but just in case:

The set up is reliant on the configuration described in this article: http://alestic.com/2009/06/ec2-elastic-ip-internal

I.e. the dns referred to in the web.config cache1.domain.com is a cname to the ec2 public dns e.g. ec2-254-254-254-254-someregion-etc.com