[JCBC-312] Reconfigure not triggered on Master -1 Created: 31/May/13  Updated: 07/Jun/13  Resolved: 07/Jun/13

Status: Resolved
Project: Couchbase Java Client
Component/s: Core
Affects Version/s: None
Fix Version/s: 1.1.7
Security Level: Public

Type: Bug Priority: Critical
Reporter: Michael Nitschinger Assignee: Michael Nitschinger
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
During some failover/rebalance scenarios, it could be the case that no master is responsible for the document. While this should not be the case, it is observed in scenarios where the client may still have an outdated config from somewhere.

This leads to RuntimExceptions raised, but reconfigure is never actively triggered. In QE tests, this manifests itself in errors during change and rebound.

While it should be elsewhere investigated how those -1 get in place, checking for this and triggering reconfigure is a safety net for running operations.

 Comments   
Comment by Michael Nitschinger [ 31/May/13 ]
No fix applied:

Will show phase timings..
--------------------------
Phase statistics for RAMP
  OK/sec: 3619

  OK: 108596
  ERR: 0
Phase statistics for CHANGE
  OK/sec: 3204

  OK: 999917
  ERR: 141386
Phase statistics for REBOUND
  OK/sec: 3966

  OK: 357026
  ERR: 26898
---------------------

Fix applied:

--------------------------
Phase statistics for RAMP
  OK/sec: 3536

  OK: 106102
  ERR: 0
Phase statistics for CHANGE
  OK/sec: 1870

  OK: 471474
  ERR: 825
Phase statistics for REBOUND
  OK/sec: 4063

  OK: 365714
  ERR: 0
---------------------

stester run: /stester -C 127.0.0.1:8050 -i 20devcluster.ini -c failover.Once --vdsw_dvname ddoc/vquery --hdsw_http_threads 5 --grace_after 30 --ept 1 --ramp 30 --num_nodes 2 --hdsw_mc_threads 10 --workload dsw.Hybrid --action_delay 10 --hdsw_cb_threads 10 --action FO_REBALANCE --dsw_timeres 1 -d -o viewlog_3_f.out
Comment by Michael Nitschinger [ 31/May/13 ]
http://review.couchbase.org/#/c/26636/
Comment by Michael Nitschinger [ 31/May/13 ]
Note that before this change, the RuntimeException bubbled up to the userlevel, blocked everything there - but more importantly, cf.checkConfigUpdate(); never got triggered!
Comment by Michael Nitschinger [ 31/May/13 ]
This also improves this scenario run:

Effective stester command line
    -C 127.0.0.1:8050 \
    -i 20devcluster.ini \
    -c failover.Once \
    --vdsw_dvname ddoc/vquery \
    --hdsw_http_threads 5 \
    --grace_after 30 \
    --ept 1 \
    --ramp 30 \
    --num_nodes 2 \
    --hdsw_mc_threads 10 \
    --workload dsw.Hybrid \
    --action_delay 10 \
    --hdsw_cb_threads 10 \
    --action FO_REBALANCE \
    --dsw_timeres 1 \
    -d \

--------------------------
Phase statistics for RAMP
  OK/sec: 3057

  OK: 91713
  ERR: 0
Phase statistics for CHANGE
  OK/sec: 3172

  OK: 957997
  ERR: 187000
Phase statistics for REBOUND
  OK/sec: 4108

  OK: 369758
  ERR: 63598
---------------------

After

Will show phase timings..
--------------------------
Phase statistics for RAMP
  OK/sec: 3453

  OK: 103594
  ERR: 0
Phase statistics for CHANGE
  OK/sec: 2346

  OK: 731968
  ERR: 549
Phase statistics for REBOUND
  OK/sec: 4064

  OK: 365817
  ERR: 0
---------------------
Generated at Thu Apr 17 00:51:25 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.