[NCBC-227] intermittent failures during add-back rebalance Created: 13/Feb/13 Updated: 07/May/13 Resolved: 07/May/13
|Project:||Couchbase .NET client library|
|Reporter:||Matt Ingenthron||Assignee:||Saakshi Manocha|
|Remaining Estimate:||Not Specified|
|Time Spent:||Not Specified|
|Original Estimate:||Not Specified|
Our integration testing is showing irregular operations failing during tests where a node is failed over, then added back and rebalanced. This is not expected, as there should be no failures during rebalance.
Assigning to Saakshi to further fill out the description.
|Comment by Saakshi Manocha [ 14/Feb/13 ]|
- Reproduced the brun test lists again to include the newly added reAdd test.
- Ran the command:
python .\brun -C Sdkd.args -S dotnet-1.2-release -V 2.0.0-1976 -i cluster_config.ini -T HYBRID_readd-2
(This command will fail two nodes, add them back and then rebalance)
- Cluster_config.ini comprise of 4 nodes:
10.3.121.134 10.3.121.135 10.3.121.136 10.3.3.206
- Output is here:
(a) Following errors occur continuously during CHANGE phase while the rebalance operation is undergoing:
[Enyim.Caching.Memcached.MemcachedNode|Error] System.IO.IOException: Failed to read from the socket '10.3.121.136:11210'. Error: SocketError value was Success, but 0 bytes were received
[Enyim.Caching.Memcached.MemcachedNode.InternalPoolImpl|Error] Could not init pool. System.NullReferenceException Object reference not set to an instance of an object.
[Sdkd.ViewQuery|Warn] Unrecognized error System.Net.WebException The operation has timed out
(b) No Errors occur during REBOUND phase, which is a good sign. This is the time when Rebalance operation is complete and no more topology changes occur.
|Comment by Mark Nunberg [ 14/Feb/13 ]|
|Interesting to note that there are NOT_MY_VBUCKET errors well after the rebalance after the readd|
|Comment by Saakshi Manocha [ 01/Mar/13 ]|
Ran a full suite of hybrid test scenarios using sdkd and latest enyim.caching changes (as done by John related to issue# CBSE-396).|
The report is ready with comments and shared through Google docs:
sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32
The report has better grades than the last month report which is good.
|Comment by Saakshi Manocha [ 05/Mar/13 ]|
The report: sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32
shows the error messages occur in debug mode during rebalance, but the error rate does not increase. And during and after rebound phase, the errors disappear and there is a full recovery of the cluster.
As long as there are no errors after rebalance operation is complete, the report is good.
|Comment by Matt Ingenthron [ 06/Mar/13 ]|
Note that we ran into this in a Java deployment today. There may be something odd happening here.|
Is it possible to capture from this, using 2.0.0 server on linux, a packet capture of port 8091, 8092 and 11210 from the client system? This would allow us to see if the cluster is behaving as expected.
|Comment by Saakshi Manocha [ 07/May/13 ]|
The required changes for this issue already got released with |
No further similar issue reported