[NCBC-227] intermittent failures during add-back rebalance Created: 13/Feb/13 Updated: 07/May/13 Resolved: 07/May/13 |
|
| Status: | Resolved |
| Project: | Couchbase .NET client library |
| Component/s: | library |
| Affects Version/s: | 1.2.1 |
| Fix Version/s: | 1.2.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Matt Ingenthron | Assignee: | Saakshi Manocha |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
Our integration testing is showing irregular operations failing during tests where a node is failed over, then added back and rebalanced. This is not expected, as there should be no failures during rebalance.
Assigning to Saakshi to further fill out the description. |
| Comments |
| Comment by Saakshi Manocha [ 14/Feb/13 ] |
|
- Reproduced the brun test lists again to include the newly added reAdd test.
- Ran the command: python .\brun -C Sdkd.args -S dotnet-1.2-release -V 2.0.0-1976 -i cluster_config.ini -T HYBRID_readd-2 (This command will fail two nodes, add them back and then rebalance) - Cluster_config.ini comprise of 4 nodes: 10.3.121.134 10.3.121.135 10.3.121.136 10.3.3.206 - Output is here: http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_CB_BASIC.txt http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_MC_BASIC.txt http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_HTTP_BASIC.txt - Observations: (a) Following errors occur continuously during CHANGE phase while the rebalance operation is undergoing: [Enyim.Caching.Memcached.MemcachedNode|Error] System.IO.IOException: Failed to read from the socket '10.3.121.136:11210'. Error: SocketError value was Success, but 0 bytes were received [Enyim.Caching.Memcached.MemcachedNode.InternalPoolImpl|Error] Could not init pool. System.NullReferenceException Object reference not set to an instance of an object. [Sdkd.ViewQuery|Warn] Unrecognized error System.Net.WebException The operation has timed out (b) No Errors occur during REBOUND phase, which is a good sign. This is the time when Rebalance operation is complete and no more topology changes occur. |
| Comment by Mark Nunberg [ 14/Feb/13 ] |
| Interesting to note that there are NOT_MY_VBUCKET errors well after the rebalance after the readd |
| Comment by Saakshi Manocha [ 01/Mar/13 ] |
|
Ran a full suite of hybrid test scenarios using sdkd and latest enyim.caching changes (as done by John related to issue# CBSE-396). The report is ready with comments and shared through Google docs: sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32 The report has better grades than the last month report which is good. |
| Comment by Saakshi Manocha [ 05/Mar/13 ] |
|
The report: sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32
shows the error messages occur in debug mode during rebalance, but the error rate does not increase. And during and after rebound phase, the errors disappear and there is a full recovery of the cluster. As long as there are no errors after rebalance operation is complete, the report is good. |
| Comment by Matt Ingenthron [ 06/Mar/13 ] |
|
Note that we ran into this in a Java deployment today. There may be something odd happening here. Is it possible to capture from this, using 2.0.0 server on linux, a packet capture of port 8091, 8092 and 11210 from the client system? This would allow us to see if the cluster is behaving as expected. |
| Comment by Saakshi Manocha [ 07/May/13 ] |
|
The required changes for this issue already got released with No further similar issue reported |