[MB-6490] Rebalance failed with reason "Partition 687 not in active nor passive set" in add in node rebalance Created: 30/Aug/12  Updated: 10/Jan/13  Resolved: 18/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server, view-engine
Affects Version/s: None
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Iryna Mironava Assignee: Aleksey Kondratenko
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 4 cores VMs CentOS, centos 6.2 64bit
build #1653, build 2.0.0-1781

Attachments: GZip Archive 10.3.121.104-8091-diag.txt.gz     GZip Archive 10.3.121.105-8091-diag.txt.gz     GZip Archive 10.3.121.110-8091-diag.txt.gz     GZip Archive 10.3.121.111-8091-diag.txt.gz     GZip Archive 10.3.121.120-8091-diag.txt.gz     GZip Archive 10.3.3.58-8091-diag.txt.gz     GZip Archive 10.3.3.64-8091-diag.txt.gz     GZip Archive 10.3.3.68-8091-diag.txt.gz     GZip Archive 10.3.3.71-8091-diag.txt.gz     GZip Archive 10.3.3.73-8091-diag.txt.gz     GZip Archive 4c82d8b6-9739-40f2-885f-e2335ddb0b54-10.3.3.58-diag.txt.gz     GZip Archive 4c82d8b6-9739-40f2-885f-e2335ddb0b54-10.3.3.64-diag.txt.gz     GZip Archive 4c82d8b6-9739-40f2-885f-e2335ddb0b54-10.3.3.68-diag.txt.gz     GZip Archive 4c82d8b6-9739-40f2-885f-e2335ddb0b54-10.3.3.71-diag.txt.gz     GZip Archive 4c82d8b6-9739-40f2-885f-e2335ddb0b54-10.3.3.73-diag.txt.gz     Text File narrowed.txt    

 Description   
Rebalance failed with error

Rebalance exited with reason {{{{badmatch,
{error,
{error,
<<"Partition 36 not in active nor passive set">>}}},
[{capi_set_view_manager,handle_call,3},
{gen_server,handle_msg,5},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
['capi_set_view_manager-saslbucket',
{wait_index_updated,36},
infinity]}},
{gen_server,call,
[{'janitor_agent-saslbucket','ns_1@10.6.2.44'},
{if_rebalance,<0.32719.854>,
{wait_index_updated,36}},
infinity]}}

with or without consistent view enable.

** In orange cluser with build 2.0.0-1781, consistent view is enable by default and rebalance failed when add 2 nodes to cluster.

** In Iryna cluster, consistent view is disable. She got rebalance failed with the same error as she mentioned in the following:

 index_aware_rebalance_disabled set false, 5 ddocs, 500K items
4 nodes cluster, remove 2 nodes and add 1 node, start rebalance

Rebalance exited with reason {{error,
                                  <<"Partition 687 not in active nor passive set">>},
                              {gen_server,call,
                                  [{'janitor_agent-bucket-0',
                                       'ns_1@10.3.121.120'},
                                   {if_rebalance,<0.14888.6>,
                                       {wait_index_updated,953}},
                                   infinity]}}

 Comments   
Comment by Aleksey Kondratenko [ 30/Aug/12 ]
Thanks. That's the bug I was seeing too. Diags should help me a lot.
Comment by Aleksey Kondratenko [ 31/Aug/12 ]
Ok. That's simple race.

We're starting to monitor indexing of vbucket that we've waited to be 'ready' inside ep-engine.

The problem is there's gap between ep-engine gets stuff in ram and same stuff is ready on disk. So we need to wait while vbucket actually gets to disk.
Comment by Aleksey Kondratenko [ 06/Sep/12 ]
Done
Comment by Thuan Nguyen [ 07/Sep/12 ]
Integrated in github-ns-server-2-0 #461 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/461/])
    MB-6490: killed needless nesting in per-bucket supervisor (Revision b1ead4cf929528b6a590810de7d35704f6ff45d2)
MB-6490: untangled ddoc replication from cb_generic_replication (Revision 91261dfe63c67ef3036da4ffa81b259b42dd24e8)
MB-6490: untangled xdcr rdoc from cb_generic_replication_srv (Revision d0463831630262f78ce24caa31e44218c6814946)
MB-6490: removed unused cb_generic_replication_srv (Revision 520ecd2708cab22796c5c32e43c7b5556617322e)
MB-6490: replicate ddocs in capi_set_view_manager (Revision 0395d10743d743a5faef748885cd9c9f9565cab2)
MB-6490: moved waiting for index updates to capi_set_view_manager (Revision 62802038b5864f6c26bbc007f5fa09d91806b880)

     Result = SUCCESS
pwansch :
Files :
* src/ns_memcached_sup.erl
* src/single_bucket_sup.erl

pwansch :
Files :
* src/capi_ddoc_replication_srv.erl

pwansch :
Files :
* src/xdc_rdoc_replication_srv.erl

pwansch :
Files :
* src/cb_generic_replication_srv.erl

pwansch :
Files :
* src/single_bucket_sup.erl
* src/capi_ddoc_replication_srv.erl
* src/capi_set_view_manager.erl

pwansch :
Files :
* src/capi_set_view_manager.erl
* src/janitor_agent.erl
Comment by Iryna Mironava [ 10/Sep/12 ]
verified
Comment by Iryna Mironava [ 12/Sep/12 ]
reproduced in 1707
Comment by Farshid Ghods (Inactive) [ 12/Sep/12 ]
promoting this to blocker since this happens more frequenetly now and its easy to reproduce
Comment by Aleksey Kondratenko [ 12/Sep/12 ]
Narrowed last phase phase of things.

capi_set_view_manager seemingly correctly added 842 to passive state in all indexes.

Then we start monitoring index update and crash. Which tells us 842 is neither active nor passive.

Could be related with ongoing 842 cleanup and fact that 842 passivation is pending
Comment by Aleksey Kondratenko [ 12/Sep/12 ]
Farshid, you mentioned it's blocker, but it's not according to ticket.

Please, update and pass to Filipe. I need his attention here, from logs it appears capi_set_view_manager is doing it right.
Comment by Farshid Ghods (Inactive) [ 12/Sep/12 ]
Yes this is a 2.0 blocker , not a 2.0 beta
will assign this to Filipe .
Thanks for traige.
Comment by Farshid Ghods (Inactive) [ 16/Sep/12 ]
Per Alk comments
Comment by Thuan Nguyen [ 01/Oct/12 ]
Hit this bug in add 2 nodes in system test with build 2.0.0-1781 with consistent view enable. I wiill get collect_info from all nodes and update this bug.

Rebalance exited with reason {{{{badmatch,
{error,
{error,
<<"Partition 145 not in active nor passive set">>}}},
[{capi_set_view_manager,handle_call,3},
{gen_server,handle_msg,5},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
['capi_set_view_manager-saslbucket',
{wait_index_updated,145},
infinity]}},
{gen_server,call,
[{'janitor_agent-saslbucket','ns_1@10.6.2.38'},
{if_rebalance,<0.1357.852>,
{wait_index_updated,145}},
infinity]}}
Comment by Filipe Manana [ 01/Oct/12 ]
Thanks Thuan.

I'm aware of the problem after ns_server's fix. Different problem (and component) but same final error.
Started working on it already last week.

There's no need to keep testing for this or posting new results - the old logs are clear enough to understand the problem.
Don't bother investing more time here before I finished my change and it gets merged. Thanks.
Comment by Farshid Ghods (Inactive) [ 01/Oct/12 ]
Tony,
can you please rephrase the bug description to reflect the use case better.
i was confused as the title says it happens only when consisten views is disabled and if it does not happen with consistent views then the priority is different. so please be more specific.

also as Filipe mentioned for this exact error le'ts not file seperate bugs
Comment by Thuan Nguyen [ 03/Oct/12 ]
Integrated in github-couchdb-preview #509 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/509/])
    MB-6490 Fix race condition in test 20-debug-params.t (Revision 80cbce15112b2b60f2c3463673c81139c3731f0d)
MB-6490 Allow unindexable partitions in the pending transition (Revision 63a94ebe8da325c89972dedac0db41c5a7a36aed)
MB-6490 Don't error when monitoring partitions in pending transition (Revision 780f5c88c84c6c9319c8f12638cc8946b8b842f5)

     Result = SUCCESS
pwansch :
Files :
* src/couch_set_view/test/20-debug-params.t

pwansch :
Files :
* src/couch_set_view/src/couch_set_view_group.erl
* src/couch_set_view/src/couch_set_view_updater.erl
* src/couch_set_view/include/couch_set_view.hrl
* src/couch_set_view/test/16-pending-transition.t
* src/couch_set_view/src/couch_set_view_util.erl

pwansch :
Files :
* src/couch_set_view/test/16-pending-transition.t
* src/couch_set_view/src/couch_set_view_group.erl
* src/couch_set_view/src/couch_set_view_util.erl
* src/couch_set_view/src/couch_db_set.erl
Comment by Thuan Nguyen [ 05/Oct/12 ]
Integrated in github-couchdb-preview #510 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/510/])
    MB-6490 Add missing checks to state transition requests (Revision bf5c23b6af2f31656dcd96f9892fc9c2c66b5b48)

     Result = SUCCESS
Farshid Ghods :
Files :
* src/couch_set_view/src/couch_set_view_group.erl
* src/couch_set_view/test/16-pending-transition.t
Comment by Iryna Mironava [ 11/Oct/12 ]
reproduced in 1820:
manifest:
<manifest><remote name="couchbase" fetch="git://10.1.1.210/"/><remote name="membase" fetch="git://10.1.1.210/"/><remote name="apache" fetch="git://github.com/apache/"/><remote name="erlang" fetch="git://github.com/erlang/"/><default remote="couchbase" revision="master"/><project name="tlm" path="tlm" revision="ab70f6d42f46621ec576889e57cb37ac2d64a84b"><copyfile dest="Makefile" src="Makefile.top"/></project><project name="bucket_engine" path="bucket_engine" revision="70b3624abc697b7d18bf3d57f331b7674544e1e7"/><project name="ep-engine" path="ep-engine" revision="3d545832ed84650e480855cf3abae6fef9fccf9d"/><project name="libconflate" path="libconflate" revision="3cf7107eaa5b52b34cc9f887cf0e2edb3465988e"/><project name="libmemcached" path="libmemcached" revision="ca739a890349ac36dc79447e37da7caa9ae819f5" remote="membase"/><project name="libvbucket" path="libvbucket" revision="00d3763593c116e8e5d97aa0b646c42885727398"/><project name="membase-cli" path="membase-cli" revision="0bc659c78e1f2d822e658778f857c8dacc7a01e5" remote="membase"/><project name="memcached" path="memcached" revision="858731183b08cd6b72fa6e68c1fb4208cb87570d" remote="membase"/><project name="moxi" path="moxi" revision="52a5fa887bfff0bf719c4ee5f29634dd8707500e"/><project name="ns_server" path="ns_server" revision="a4fd05a0fa64f090800baccc887bbd416b9f8f27"/><project name="portsigar" path="portsigar" revision="1bc865e1622fb93a3fe0d1a4cdf18eb97ed9d600"/><project name="sigar" path="sigar" revision="63a3cd1b316d2d4aa6dd31ce8fc66101b983e0b0"/><project name="couchbase-examples" path="couchbase-examples" revision="21e6161a1d064979b5c6aa99cd34ccc41c9d7aca"/><project name="couchbase-python-client" path="couchbase-python-client" revision="86b398e4fbc1f2e38d356e14df0c1bb4e3d2427b"/><project name="couchdb" path="couchdb" revision="6b9fa5f115e675ba345bf5ffa17e57423efd86ba"/><project name="couchdbx-app" path="couchdbx-app" revision="d196377b5b1ba3ce25f1b92066e2741898b01a1e"/><project name="couchstore" path="couchstore" revision="29579bd47f7c916c43116722b8f4962b4ea9fff0"/><project name="geocouch" path="geocouch" revision="7782df1a53104e9c8bb9ef941a9b499bbc7cd61e"/><project name="mccouch" path="mccouch" revision="88701cc326bc3dde4ed072bb8441be83adcfb2a5"/><project name="testrunner" path="testrunner" revision="bc501cfa4c3453f9c2a7b8cf48ac81da3dca053c"/><project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/><project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/><project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/><project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/><project name="gperftools" path="gperftools" revision="8f60ba949fb8576c530ef4be148bff97106ddc59" remote="couchbase"/><project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/></manifest>

attaching new logs
Comment by Iryna Mironava [ 11/Oct/12 ]
logs from build 1820
Comment by Iryna Mironava [ 11/Oct/12 ]
reproduced also on build 1827
Comment by Filipe Manana [ 11/Oct/12 ]
There's limited information in the logs, due to rotation.

But looking at file 4c82d8b6-9739-40f2-885f-e2335ddb0b54-10.3.3.58-diag.txt,

The last occurrence of the error, line 2416042, the error seems valid from view engine point of view. Going up above that line, none of the indexes has vbucket 624 in the active nor passive state.

Above that line, I also see that ns_server marks vbucket 624 for cleanup in several indexes, but doesn't mark it as active/passive after. Example in line 2410127:

[views:info,2012-10-09T14:46:46.117,ns_1@10.3.3.58:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:472]
couch_set_view:set_partition_states([<<"default">>,

From what I can see, the error is valid, might be a bad coordination from ns_server.

Would also help here if ns_server logged the name of the respective index (design doc) when such error happens. Makes it easier to troubleshoot when there are many indexes.
Comment by Aleksey Kondratenko [ 11/Oct/12 ]
Appears to be problem in waiting for persisted checkpoint. Which is causing us to assume vbucket is 'ready' too soon.
Comment by Aleksey Kondratenko [ 11/Oct/12 ]
http://review.couchbase.org/#/c/21552/ and http://review.couchbase.org/#/c/21553/
Comment by Iryna Mironava [ 18/Oct/12 ]
reproduced in 1850
<manifest><remote name="couchbase" fetch="git://10.1.1.210/"/><remote name="membase" fetch="git://10.1.1.210/"/><remote name="apache" fetch="git://github.com/apache/"/><remote name="erlang" fetch="git://github.com/erlang/"/><default remote="couchbase" revision="master"/><project name="tlm" path="tlm" revision="ab70f6d42f46621ec576889e57cb37ac2d64a84b"><copyfile src="Makefile.top" dest="Makefile"/></project><project name="bucket_engine" path="bucket_engine" revision="70b3624abc697b7d18bf3d57f331b7674544e1e7"/><project name="ep-engine" path="ep-engine" revision="25b403263ccd67ffe3205a474d8f93a21f2936d0"/><project name="libconflate" path="libconflate" revision="2cc8eff8e77d497d9f03a30fafaecb85280535d6"/><project name="libmemcached" path="libmemcached" revision="ca739a890349ac36dc79447e37da7caa9ae819f5" remote="membase"/><project name="libvbucket" path="libvbucket" revision="00d3763593c116e8e5d97aa0b646c42885727398"/><project name="membase-cli" path="membase-cli" revision="c82db287eab652d25116b042d4627a6931722a8e" remote="membase"/><project name="memcached" path="memcached" revision="858731183b08cd6b72fa6e68c1fb4208cb87570d" remote="membase"/><project name="moxi" path="moxi" revision="52a5fa887bfff0bf719c4ee5f29634dd8707500e"/><project name="ns_server" path="ns_server" revision="65e7ebe2d45904e82e1226ddeca257a2cd9d5075"/><project name="portsigar" path="portsigar" revision="1bc865e1622fb93a3fe0d1a4cdf18eb97ed9d600"/><project name="sigar" path="sigar" revision="63a3cd1b316d2d4aa6dd31ce8fc66101b983e0b0"/><project name="couchbase-examples" path="couchbase-examples" revision="21e6161a1d064979b5c6aa99cd34ccc41c9d7aca"/><project name="couchbase-python-client" path="couchbase-python-client" revision="86b398e4fbc1f2e38d356e14df0c1bb4e3d2427b"/><project name="couchdb" path="couchdb" revision="23cec9997b38ac82cab310b7560d01db529c1ae2"/><project name="couchdbx-app" path="couchdbx-app" revision="d196377b5b1ba3ce25f1b92066e2741898b01a1e"/><project name="couchstore" path="couchstore" revision="29579bd47f7c916c43116722b8f4962b4ea9fff0"/><project name="geocouch" path="geocouch" revision="b0bd742551639c52030c070e5bf9390edbb536ba"/><project name="mccouch" path="mccouch" revision="88701cc326bc3dde4ed072bb8441be83adcfb2a5"/><project name="testrunner" path="testrunner" revision="48fc95d4e1009d0f40a2c4e2e59448dc3e4fcad3"/><project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/><project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/><project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/><project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/><project name="gperftools" path="gperftools" revision="8f60ba949fb8576c530ef4be148bff97106ddc59" remote="couchbase"/><project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/></manifest>


logs:
http://qa.hq.northscale.net/job/centos-64-2.0-view-query-tests/516/artifact/logs/testrunner-12-Oct-16_14-53-26/d1387940-7fbd-4e43-91ef-460736b2e37d-10.3.3.114-diag.txt.gz
http://qa.hq.northscale.net/job/centos-64-2.0-view-query-tests/516/artifact/logs/testrunner-12-Oct-16_14-53-26/d1387940-7fbd-4e43-91ef-460736b2e37d-10.3.3.115-diag.txt.gz
http://qa.hq.northscale.net/job/centos-64-2.0-view-query-tests/516/artifact/logs/testrunner-12-Oct-16_14-53-26/d1387940-7fbd-4e43-91ef-460736b2e37d-10.3.3.121-diag.txt.gz
http://qa.hq.northscale.net/job/centos-64-2.0-view-query-tests/516/artifact/logs/testrunner-12-Oct-16_14-53-26/d1387940-7fbd-4e43-91ef-460736b2e37d-10.3.3.122-diag.txt.gz

2012-10-17 01:31:46.990 ns_orchestrator:2:info:message(ns_1@10.3.3.115) - Rebalance exited with reason {{{{badmatch,
                                 {error,
                                  {error,
                                   <<"Partition 672 not in active nor passive set">>}}},
                                [{capi_set_view_manager,handle_call,3},
                                 {gen_server,handle_msg,5},
                                 {gen_server,init_it,6},
                                 {proc_lib,init_p_do_apply,3}]},
                               {gen_server,call,
                                ['capi_set_view_manager-default',
                                 {wait_index_updated,672},
                                 infinity]}},
                              {gen_server,call,
                               [{'janitor_agent-default','ns_1@10.3.3.115'},
                                {if_rebalance,<0.7393.100>,
                                 {wait_index_updated,672}},
                                infinity]}}
Comment by Aleksey Kondratenko [ 18/Oct/12 ]
Thanks for report. I managed to understand what happened by looking at MB-6955. Fix is coming soon.
Generated at Sat Oct 25 13:29:35 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.