[NCBC-227] intermittent failures during add-back rebalance Created: 13/Feb/13  Updated: 07/May/13  Resolved: 07/May/13

Status: Resolved
Project: Couchbase .NET client library
Component/s: library
Affects Version/s: 1.2.1
Fix Version/s: 1.2.5

Type: Bug Priority: Major
Reporter: Matt Ingenthron Assignee: Saakshi Manocha
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Our integration testing is showing irregular operations failing during tests where a node is failed over, then added back and rebalanced. This is not expected, as there should be no failures during rebalance.

Assigning to Saakshi to further fill out the description.

 Comments   
Comment by Saakshi Manocha [ 14/Feb/13 ]
- Reproduced the brun test lists again to include the newly added reAdd test.

- Ran the command:
python .\brun -C Sdkd.args -S dotnet-1.2-release -V 2.0.0-1976 -i cluster_config.ini -T HYBRID_readd-2
(This command will fail two nodes, add them back and then rebalance)

- Cluster_config.ini comprise of 4 nodes:
10.3.121.134 10.3.121.135 10.3.121.136 10.3.3.206

- Output is here:
http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_CB_BASIC.txt

http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_MC_BASIC.txt

http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_HTTP_BASIC.txt

- Observations:
(a) Following errors occur continuously during CHANGE phase while the rebalance operation is undergoing:
      [Enyim.Caching.Memcached.MemcachedNode|Error] System.IO.IOException: Failed to read from the socket '10.3.121.136:11210'. Error: SocketError value was Success, but 0 bytes were received
      [Enyim.Caching.Memcached.MemcachedNode.InternalPoolImpl|Error] Could not init pool. System.NullReferenceException Object reference not set to an instance of an object.
      [Sdkd.ViewQuery|Warn] Unrecognized error System.Net.WebException The operation has timed out

(b) No Errors occur during REBOUND phase, which is a good sign. This is the time when Rebalance operation is complete and no more topology changes occur.
Comment by Mark Nunberg [ 14/Feb/13 ]
Interesting to note that there are NOT_MY_VBUCKET errors well after the rebalance after the readd
Comment by Saakshi Manocha [ 01/Mar/13 ]
Ran a full suite of hybrid test scenarios using sdkd and latest enyim.caching changes (as done by John related to issue# CBSE-396).
The report is ready with comments and shared through Google docs:
sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32

The report has better grades than the last month report which is good.
Comment by Saakshi Manocha [ 05/Mar/13 ]
The report: sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32
shows the error messages occur in debug mode during rebalance, but the error rate does not increase. And during and after rebound phase, the errors disappear and there is a full recovery of the cluster.
As long as there are no errors after rebalance operation is complete, the report is good.
Comment by Matt Ingenthron [ 06/Mar/13 ]
Note that we ran into this in a Java deployment today. There may be something odd happening here.

Is it possible to capture from this, using 2.0.0 server on linux, a packet capture of port 8091, 8092 and 11210 from the client system? This would allow us to see if the cluster is behaving as expected.
Comment by Saakshi Manocha [ 07/May/13 ]
The required changes for this issue already got released with NCBC-228, so I'm closing out this one.
No further similar issue reported
Generated at Thu Jul 10 22:49:27 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.