[MB-6490] Rebalance failed with reason "Partition 687 not in active nor passive set" in add in node rebalance Created: 30/Aug/12 Updated: 10/Jan/13 Resolved: 18/Oct/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | ns_server, view-engine |
| Affects Version/s: | None |
| Fix Version/s: | 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Iryna Mironava | Assignee: | Aleksey Kondratenko |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
4 cores VMs CentOS, centos 6.2 64bit
build #1653, build 2.0.0-1781 |
||
| Attachments: |
|
| Description |
|
Rebalance failed with error
Rebalance exited with reason {{{{badmatch, {error, {error, <<"Partition 36 not in active nor passive set">>}}}, [{capi_set_view_manager,handle_call,3}, {gen_server,handle_msg,5}, {gen_server,init_it,6}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, ['capi_set_view_manager-saslbucket', {wait_index_updated,36}, infinity]}}, {gen_server,call, [{'janitor_agent-saslbucket','ns_1@10.6.2.44'}, {if_rebalance,<0.32719.854>, {wait_index_updated,36}}, infinity]}} with or without consistent view enable. ** In orange cluser with build 2.0.0-1781, consistent view is enable by default and rebalance failed when add 2 nodes to cluster. ** In Iryna cluster, consistent view is disable. She got rebalance failed with the same error as she mentioned in the following: index_aware_rebalance_disabled set false, 5 ddocs, 500K items 4 nodes cluster, remove 2 nodes and add 1 node, start rebalance Rebalance exited with reason {{error, <<"Partition 687 not in active nor passive set">>}, {gen_server,call, [{'janitor_agent-bucket-0', 'ns_1@10.3.121.120'}, {if_rebalance,<0.14888.6>, {wait_index_updated,953}}, infinity]}} |
| Comments |
| Comment by Aleksey Kondratenko [ 30/Aug/12 ] |
| Thanks. That's the bug I was seeing too. Diags should help me a lot. |
| Comment by Aleksey Kondratenko [ 31/Aug/12 ] |
|
Ok. That's simple race. We're starting to monitor indexing of vbucket that we've waited to be 'ready' inside ep-engine. The problem is there's gap between ep-engine gets stuff in ram and same stuff is ready on disk. So we need to wait while vbucket actually gets to disk. |
| Comment by Aleksey Kondratenko [ 06/Sep/12 ] |
| Done |
| Comment by Thuan Nguyen [ 07/Sep/12 ] |
|
Integrated in github-ns-server-2-0 #461 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/461/]) Result = SUCCESS pwansch : Files : * src/ns_memcached_sup.erl * src/single_bucket_sup.erl pwansch : Files : * src/capi_ddoc_replication_srv.erl pwansch : Files : * src/xdc_rdoc_replication_srv.erl pwansch : Files : * src/cb_generic_replication_srv.erl pwansch : Files : * src/single_bucket_sup.erl * src/capi_ddoc_replication_srv.erl * src/capi_set_view_manager.erl pwansch : Files : * src/capi_set_view_manager.erl * src/janitor_agent.erl |
| Comment by Iryna Mironava [ 10/Sep/12 ] |
| verified |
| Comment by Iryna Mironava [ 12/Sep/12 ] |
| reproduced in 1707 |
| Comment by Farshid Ghods [ 12/Sep/12 ] |
| promoting this to blocker since this happens more frequenetly now and its easy to reproduce |
| Comment by Aleksey Kondratenko [ 12/Sep/12 ] |
|
Narrowed last phase phase of things. capi_set_view_manager seemingly correctly added 842 to passive state in all indexes. Then we start monitoring index update and crash. Which tells us 842 is neither active nor passive. Could be related with ongoing 842 cleanup and fact that 842 passivation is pending |
| Comment by Aleksey Kondratenko [ 12/Sep/12 ] |
|
Farshid, you mentioned it's blocker, but it's not according to ticket. Please, update and pass to Filipe. I need his attention here, from logs it appears capi_set_view_manager is doing it right. |
| Comment by Farshid Ghods [ 12/Sep/12 ] |
|
Yes this is a 2.0 blocker , not a 2.0 beta
will assign this to Filipe . Thanks for traige. |
| Comment by Farshid Ghods [ 16/Sep/12 ] |
| Per Alk comments |
| Comment by Thuan Nguyen [ 01/Oct/12 ] |
|
Hit this bug in add 2 nodes in system test with build 2.0.0-1781 with consistent view enable. I wiill get collect_info from all nodes and update this bug.
Rebalance exited with reason {{{{badmatch, {error, {error, <<"Partition 145 not in active nor passive set">>}}}, [{capi_set_view_manager,handle_call,3}, {gen_server,handle_msg,5}, {gen_server,init_it,6}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, ['capi_set_view_manager-saslbucket', {wait_index_updated,145}, infinity]}}, {gen_server,call, [{'janitor_agent-saslbucket','ns_1@10.6.2.38'}, {if_rebalance,<0.1357.852>, {wait_index_updated,145}}, infinity]}} |
| Comment by Filipe Manana [ 01/Oct/12 ] |
|
Thanks Thuan.
I'm aware of the problem after ns_server's fix. Different problem (and component) but same final error. Started working on it already last week. There's no need to keep testing for this or posting new results - the old logs are clear enough to understand the problem. Don't bother investing more time here before I finished my change and it gets merged. Thanks. |
| Comment by Farshid Ghods [ 01/Oct/12 ] |
|
Tony,
can you please rephrase the bug description to reflect the use case better. i was confused as the title says it happens only when consisten views is disabled and if it does not happen with consistent views then the priority is different. so please be more specific. also as Filipe mentioned for this exact error le'ts not file seperate bugs |
| Comment by Thuan Nguyen [ 03/Oct/12 ] |
|
Integrated in github-couchdb-preview #509 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/509/]) Result = SUCCESS pwansch : Files : * src/couch_set_view/test/20-debug-params.t pwansch : Files : * src/couch_set_view/src/couch_set_view_group.erl * src/couch_set_view/src/couch_set_view_updater.erl * src/couch_set_view/include/couch_set_view.hrl * src/couch_set_view/test/16-pending-transition.t * src/couch_set_view/src/couch_set_view_util.erl pwansch : Files : * src/couch_set_view/test/16-pending-transition.t * src/couch_set_view/src/couch_set_view_group.erl * src/couch_set_view/src/couch_set_view_util.erl * src/couch_set_view/src/couch_db_set.erl |
| Comment by Thuan Nguyen [ 05/Oct/12 ] |
|
Integrated in github-couchdb-preview #510 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/510/]) Result = SUCCESS Farshid Ghods : Files : * src/couch_set_view/src/couch_set_view_group.erl * src/couch_set_view/test/16-pending-transition.t |
| Comment by Iryna Mironava [ 11/Oct/12 ] |
|
reproduced in 1820: manifest: <manifest><remote name="couchbase" fetch="git://10.1.1.210/"/><remote name="membase" fetch="git://10.1.1.210/"/><remote name="apache" fetch="git://github.com/apache/"/><remote name="erlang" fetch="git://github.com/erlang/"/><default remote="couchbase" revision="master"/><project name="tlm" path="tlm" revision="ab70f6d42f46621ec576889e57cb37ac2d64a84b"><copyfile dest="Makefile" src="Makefile.top"/></project><project name="bucket_engine" path="bucket_engine" revision="70b3624abc697b7d18bf3d57f331b7674544e1e7"/><project name="ep-engine" path="ep-engine" revision="3d545832ed84650e480855cf3abae6fef9fccf9d"/><project name="libconflate" path="libconflate" revision="3cf7107eaa5b52b34cc9f887cf0e2edb3465988e"/><project name="libmemcached" path="libmemcached" revision="ca739a890349ac36dc79447e37da7caa9ae819f5" remote="membase"/><project name="libvbucket" path="libvbucket" revision="00d3763593c116e8e5d97aa0b646c42885727398"/><project name="membase-cli" path="membase-cli" revision="0bc659c78e1f2d822e658778f857c8dacc7a01e5" remote="membase"/><project name="memcached" path="memcached" revision="858731183b08cd6b72fa6e68c1fb4208cb87570d" remote="membase"/><project name="moxi" path="moxi" revision="52a5fa887bfff0bf719c4ee5f29634dd8707500e"/><project name="ns_server" path="ns_server" revision="a4fd05a0fa64f090800baccc887bbd416b9f8f27"/><project name="portsigar" path="portsigar" revision="1bc865e1622fb93a3fe0d1a4cdf18eb97ed9d600"/><project name="sigar" path="sigar" revision="63a3cd1b316d2d4aa6dd31ce8fc66101b983e0b0"/><project name="couchbase-examples" path="couchbase-examples" revision="21e6161a1d064979b5c6aa99cd34ccc41c9d7aca"/><project name="couchbase-python-client" path="couchbase-python-client" revision="86b398e4fbc1f2e38d356e14df0c1bb4e3d2427b"/><project name="couchdb" path="couchdb" revision="6b9fa5f115e675ba345bf5ffa17e57423efd86ba"/><project name="couchdbx-app" path="couchdbx-app" revision="d196377b5b1ba3ce25f1b92066e2741898b01a1e"/><project name="couchstore" path="couchstore" revision="29579bd47f7c916c43116722b8f4962b4ea9fff0"/><project name="geocouch" path="geocouch" revision="7782df1a53104e9c8bb9ef941a9b499bbc7cd61e"/><project name="mccouch" path="mccouch" revision="88701cc326bc3dde4ed072bb8441be83adcfb2a5"/><project name="testrunner" path="testrunner" revision="bc501cfa4c3453f9c2a7b8cf48ac81da3dca053c"/><project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/><project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/><project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/><project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/><project name="gperftools" path="gperftools" revision="8f60ba949fb8576c530ef4be148bff97106ddc59" remote="couchbase"/><project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/></manifest> attaching new logs |
| Comment by Iryna Mironava [ 11/Oct/12 ] |
| logs from build 1820 |
| Comment by Iryna Mironava [ 11/Oct/12 ] |
| reproduced also on build 1827 |
| Comment by Filipe Manana [ 11/Oct/12 ] |
|
There's limited information in the logs, due to rotation. But looking at file 4c82d8b6-9739-40f2-885f-e2335ddb0b54-10.3.3.58-diag.txt, The last occurrence of the error, line 2416042, the error seems valid from view engine point of view. Going up above that line, none of the indexes has vbucket 624 in the active nor passive state. Above that line, I also see that ns_server marks vbucket 624 for cleanup in several indexes, but doesn't mark it as active/passive after. Example in line 2410127: [views:info,2012-10-09T14:46:46.117,ns_1@10.3.3.58:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:472] couch_set_view:set_partition_states([<<"default">>, From what I can see, the error is valid, might be a bad coordination from ns_server. Would also help here if ns_server logged the name of the respective index (design doc) when such error happens. Makes it easier to troubleshoot when there are many indexes. |
| Comment by Aleksey Kondratenko [ 11/Oct/12 ] |
| Appears to be problem in waiting for persisted checkpoint. Which is causing us to assume vbucket is 'ready' too soon. |
| Comment by Aleksey Kondratenko [ 11/Oct/12 ] |
| http://review.couchbase.org/#/c/21552/ and http://review.couchbase.org/#/c/21553/ |
| Comment by Iryna Mironava [ 18/Oct/12 ] |
|
reproduced in 1850 <manifest><remote name="couchbase" fetch="git://10.1.1.210/"/><remote name="membase" fetch="git://10.1.1.210/"/><remote name="apache" fetch="git://github.com/apache/"/><remote name="erlang" fetch="git://github.com/erlang/"/><default remote="couchbase" revision="master"/><project name="tlm" path="tlm" revision="ab70f6d42f46621ec576889e57cb37ac2d64a84b"><copyfile src="Makefile.top" dest="Makefile"/></project><project name="bucket_engine" path="bucket_engine" revision="70b3624abc697b7d18bf3d57f331b7674544e1e7"/><project name="ep-engine" path="ep-engine" revision="25b403263ccd67ffe3205a474d8f93a21f2936d0"/><project name="libconflate" path="libconflate" revision="2cc8eff8e77d497d9f03a30fafaecb85280535d6"/><project name="libmemcached" path="libmemcached" revision="ca739a890349ac36dc79447e37da7caa9ae819f5" remote="membase"/><project name="libvbucket" path="libvbucket" revision="00d3763593c116e8e5d97aa0b646c42885727398"/><project name="membase-cli" path="membase-cli" revision="c82db287eab652d25116b042d4627a6931722a8e" remote="membase"/><project name="memcached" path="memcached" revision="858731183b08cd6b72fa6e68c1fb4208cb87570d" remote="membase"/><project name="moxi" path="moxi" revision="52a5fa887bfff0bf719c4ee5f29634dd8707500e"/><project name="ns_server" path="ns_server" revision="65e7ebe2d45904e82e1226ddeca257a2cd9d5075"/><project name="portsigar" path="portsigar" revision="1bc865e1622fb93a3fe0d1a4cdf18eb97ed9d600"/><project name="sigar" path="sigar" revision="63a3cd1b316d2d4aa6dd31ce8fc66101b983e0b0"/><project name="couchbase-examples" path="couchbase-examples" revision="21e6161a1d064979b5c6aa99cd34ccc41c9d7aca"/><project name="couchbase-python-client" path="couchbase-python-client" revision="86b398e4fbc1f2e38d356e14df0c1bb4e3d2427b"/><project name="couchdb" path="couchdb" revision="23cec9997b38ac82cab310b7560d01db529c1ae2"/><project name="couchdbx-app" path="couchdbx-app" revision="d196377b5b1ba3ce25f1b92066e2741898b01a1e"/><project name="couchstore" path="couchstore" revision="29579bd47f7c916c43116722b8f4962b4ea9fff0"/><project name="geocouch" path="geocouch" revision="b0bd742551639c52030c070e5bf9390edbb536ba"/><project name="mccouch" path="mccouch" revision="88701cc326bc3dde4ed072bb8441be83adcfb2a5"/><project name="testrunner" path="testrunner" revision="48fc95d4e1009d0f40a2c4e2e59448dc3e4fcad3"/><project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/><project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/><project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/><project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/><project name="gperftools" path="gperftools" revision="8f60ba949fb8576c530ef4be148bff97106ddc59" remote="couchbase"/><project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/></manifest> logs: http://qa.hq.northscale.net/job/centos-64-2.0-view-query-tests/516/artifact/logs/testrunner-12-Oct-16_14-53-26/d1387940-7fbd-4e43-91ef-460736b2e37d-10.3.3.114-diag.txt.gz http://qa.hq.northscale.net/job/centos-64-2.0-view-query-tests/516/artifact/logs/testrunner-12-Oct-16_14-53-26/d1387940-7fbd-4e43-91ef-460736b2e37d-10.3.3.115-diag.txt.gz http://qa.hq.northscale.net/job/centos-64-2.0-view-query-tests/516/artifact/logs/testrunner-12-Oct-16_14-53-26/d1387940-7fbd-4e43-91ef-460736b2e37d-10.3.3.121-diag.txt.gz http://qa.hq.northscale.net/job/centos-64-2.0-view-query-tests/516/artifact/logs/testrunner-12-Oct-16_14-53-26/d1387940-7fbd-4e43-91ef-460736b2e37d-10.3.3.122-diag.txt.gz 2012-10-17 01:31:46.990 ns_orchestrator:2:info:message(ns_1@10.3.3.115) - Rebalance exited with reason {{{{badmatch, {error, {error, <<"Partition 672 not in active nor passive set">>}}}, [{capi_set_view_manager,handle_call,3}, {gen_server,handle_msg,5}, {gen_server,init_it,6}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, ['capi_set_view_manager-default', {wait_index_updated,672}, infinity]}}, {gen_server,call, [{'janitor_agent-default','ns_1@10.3.3.115'}, {if_rebalance,<0.7393.100>, {wait_index_updated,672}}, infinity]}} |
| Comment by Aleksey Kondratenko [ 18/Oct/12 ] |
|
Thanks for report. I managed to understand what happened by looking at |