[MB-7057] [system test] Rebalance failed with reason "Partition x not in active nor passive set" Created: 30/Oct/12  Updated: 16/Nov/12  Resolved: 13/Nov/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64bit build 2.0.0-1908


 Description   
Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- 24.8 GB RAM for couchbase server at each node
- SSD disk format ext4 on /data
- Each server has its own SSD drive, no disk sharing with other server.
- Create cluster with 6 nodes installed couchbase server 2.0.0-1908
- Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1908-rel.rpm.manifest.xml
- Cluster has 2 buckets, default (12GB with 2 replica) and saslbucket (12GB with 1 replica).
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43

* Load 16 million items to default bucket and 20 million items to saslbuckett. Each key has size from 512 bytes to 1024 bytes
* After done loading, wait until initial index. Disable view compaction.
* After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
* Queries all 4 views from 2 docs
* Do swap rebalance, remove node 39, 40 and add node 44, 45.
* At the end of rebalance saslbucket, rebalance exited with timeout on node 43
* Then see a lot of reset connection to mccouch. Updated bug MB-7046
* Kill all loads pointing to this cluster. Node 43 did not back to stable state.
* Erlang in node 43 looks like hang. Kill beam.smp and restart couchbase server on node 43 (will update this bug soon)
* Waiting for node 43 completely up, disk write queue and drain rate down to 0.
* Start queries and mutate items on both bucket. Waiting a few minutes and start swap rebalance.
* Rebalance failed with error

2012-10-30 17:52:37.698 ns_rebalancer:0:info:message(ns_1@10.6.2.37) - Started rebalancing bucket saslbucket
2012-10-30 17:52:38.722 ns_vbucket_mover:0:info:message(ns_1@10.6.2.37) - Bucket "saslbucket" rebalance appears to be swap rebalance
2012-10-30 17:52:38.968 ns_rebalancer:0:info:message(ns_1@10.6.2.37) - Started rebalancing bucket default
2012-10-30 17:52:40.592 ns_vbucket_mover:0:info:message(ns_1@10.6.2.37) - Bucket "default" rebalance does not seem to be swap rebalance
2012-10-30 17:52:48.027 ns_vbucket_mover:0:critical:message(ns_1@10.6.2.37) - <0.4774.771> exited with {noproc,
                          {gen_server,call,
                           [{'janitor_agent-default','ns_1@10.6.2.37'},
                            {if_rebalance,<0.4746.771>,
                             {update_vbucket_state,61,active,paused,
                              undefined}},
                            infinity]}}

Look at diag log of node 37, see error: "Partition 857 not in active nor passive set"

[ns_server:error,2012-10-30T17:52:45.760,ns_1@10.6.2.37:<0.20186.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 589040 us
[ns_server:error,2012-10-30T17:52:46.835,ns_1@10.6.2.37:<0.16711.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 963763 us
[rebalance:info,2012-10-30T17:52:47.098,ns_1@10.6.2.37:<0.4781.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 530 state change: {'ns_1@10.6.2.40',active,paused,undefined}
[rebalance:info,2012-10-30T17:52:47.153,ns_1@10.6.2.37:<0.4781.771>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 530 on ns_1@10.6.2.40
[rebalance:debug,2012-10-30T17:52:47.162,ns_1@10.6.2.37:<0.6121.771>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 44 in vbucket 530
[ns_server:debug,2012-10-30T17:52:47.203,ns_1@10.6.2.37:janitor_agent-default<0.16716.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.6121.771> (ok)
[rebalance:info,2012-10-30T17:52:47.206,ns_1@10.6.2.37:<0.6125.771>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.6.2.37 (vbucket 530)
[ns_server:debug,2012-10-30T17:52:47.225,ns_1@10.6.2.37:<0.6127.771>:capi_set_view_manager:do_wait_index_updated:608]References to wait: [#Ref<0.0.3553.75720>] ("default", 530)
[ns_server:info,2012-10-30T17:52:47.482,ns_1@10.6.2.37:ns_config_rep<0.758.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.6.2.40'

[rebalance:info,2012-10-30T17:52:47.626,ns_1@10.6.2.37:<0.4779.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 857 state change: {'ns_1@10.6.2.43',active,paused,undefined}
[rebalance:info,2012-10-30T17:52:47.709,ns_1@10.6.2.37:<0.4779.771>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 857 on ns_1@10.6.2.43
[rebalance:debug,2012-10-30T17:52:47.739,ns_1@10.6.2.37:<0.6284.771>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 42 in vbucket 857
[ns_server:debug,2012-10-30T17:52:47.753,ns_1@10.6.2.37:janitor_agent-default<0.16716.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.6284.771> (ok)
[rebalance:info,2012-10-30T17:52:47.758,ns_1@10.6.2.37:<0.6286.771>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.6.2.37 (vbucket 857)
[ns_server:debug,2012-10-30T17:52:47.772,ns_1@10.6.2.37:<0.16695.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_config_events,<0.16694.0>} exited with reason {{badmatch,
                                                                                   {error,
                                                                                    {error,
                                                                                     <<"Partition 857 not in active nor passive set">>}}},
                                                                                  [{capi_set_view_manager,
                                                                                    handle_call,
                                                                                    3},
                                                                                   {gen_server,
                                                                                    handle_msg,
                                                                                    5},
                                                                                   {gen_server,
                                                                                    init_it,
                                                                                    6},
                                                                                   {proc_lib,
                                                                                    init_p_do_apply,
                                                                                    3}]}
[ns_server:debug,2012-10-30T17:52:47.772,ns_1@10.6.2.37:<0.6127.771>:capi_set_view_manager:do_wait_index_updated:621]Parent died: {'DOWN',#Ref<0.0.3553.75791>,process,<0.16694.0>,
                 {{badmatch,
                      {error,
                          {error,
                              <<"Partition 857 not in active nor passive set">>}}},
                  [{capi_set_view_manager,handle_call,3},
                   {gen_server,handle_msg,5},
                   {gen_server,init_it,6},
                   {proc_lib,init_p_do_apply,3}]}}
[error_logger:error,2012-10-30T17:52:47.775,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.6297.771>
    registered_name: []
    exception throw: {error,<<"Partition 857 not in active nor passive set">>}
      in function couch_set_view:monitor_partition_update/3
      in call from capi_set_view_manager:'-do_wait_index_updated/4-fun-0-'/4
      in call from lists:foldl/3
      in call from capi_set_view_manager:do_wait_index_updated/4
    ancestors: ['capi_set_view_manager-default','single_bucket_sup-default',
                  <0.16686.0>]
    messages: []
    links: [<0.16694.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 6765
    stack_size: 24
    reductions: 229
  neighbours:

[ns_server:error,2012-10-30T17:52:47.778,ns_1@10.6.2.37:<0.4781.771>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.6125.771>,
                            {{{{badmatch,
                                {error,
                                 {error,
                                  <<"Partition 857 not in active nor passive set">>}}},
                               [{capi_set_view_manager,handle_call,3},
                                {gen_server,handle_msg,5},
                                {gen_server,init_it,6},
                                {proc_lib,init_p_do_apply,3}]},
                              {gen_server,call,
                               ['capi_set_view_manager-default',
                                {wait_index_updated,857},
                                infinity]}},
                             {gen_server,call,
                              [{'janitor_agent-default','ns_1@10.6.2.37'},
                               {if_rebalance,<0.4746.771>,
                                {wait_index_updated,530}},
                               infinity]}}}
[ns_server:debug,2012-10-30T17:52:47.804,ns_1@10.6.2.37:<0.16705.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {mc_couch_events,<0.16694.0>} exited with reason {{badmatch,
                                                                                  {error,
                                                                                   {error,
                                                                                    <<"Partition 857 not in active nor passive set">>}}},
                                                                                 [{capi_set_view_manager,
                                                                                   handle_call,
                                                                                   3},
                                                                                  {gen_server,
                                                                                   handle_msg,
                                                                                   5},
                                                                                  {gen_server,
                                                                                   init_it,
                                                                                   6},
                                                                                  {proc_lib,
                                                                                   init_p_do_apply,
                                                                                   3}]}
[ns_server:debug,2012-10-30T17:52:47.804,ns_1@10.6.2.37:<0.16704.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_node_disco_events,<0.16694.0>} exited with reason {{badmatch,
                                                                                       {error,
                                                                                        {error,
                                                                                         <<"Partition 857 not in active nor passive set">>}}},
                                                                                      [{capi_set_view_manager,
                                                                                        handle_call,
                                                                                        3},
                                                                                       {gen_server,
                                                                                        handle_msg,
                                                                                        5},
                                                                                       {gen_server,
                                                                                        init_it,
                                                                                        6},
                                                                                       {proc_lib,
                                                                                        init_p_do_apply,
                                                                                        3}]}
[ns_server:error,2012-10-30T17:52:47.808,ns_1@10.6.2.37:<0.4779.771>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.6286.771>,
                            {{{{badmatch,
                                {error,
                                 {error,
                                  <<"Partition 857 not in active nor passive set">>}}},
                               [{capi_set_view_manager,handle_call,3},
                                {gen_server,handle_msg,5},
                                {gen_server,init_it,6},
                                {proc_lib,init_p_do_apply,3}]},
                              {gen_server,call,
                               ['capi_set_view_manager-default',
                                {wait_index_updated,857},
                                infinity]}},
                             {gen_server,call,
                              [{'janitor_agent-default','ns_1@10.6.2.37'},
                               {if_rebalance,<0.4746.771>,
                                {wait_index_updated,857}},
                               infinity]}}}
[rebalance:info,2012-10-30T17:52:47.822,ns_1@10.6.2.37:<0.5479.771>:ebucketmigrator_srv:do_confirm_sent_messages:671]Got close ack!

[rebalance:info,2012-10-30T17:52:47.827,ns_1@10.6.2.37:<0.5314.771>:ebucketmigrator_srv:do_confirm_sent_messages:671]Got close ack!

[rebalance:info,2012-10-30T17:52:47.835,ns_1@10.6.2.37:<0.4774.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 61 state change: {'ns_1@10.6.2.37',active,paused,undefined}
[error_logger:error,2012-10-30T17:52:47.811,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.16694.0> terminating
** Last message in was {wait_index_updated,857}
** When Server state == {state,"default",'capi_ddoc_replication_srv-default',
                               ['ns_1@10.6.2.38','ns_1@10.6.2.42',
                                'ns_1@10.6.2.43','ns_1@10.6.2.44',
                                'ns_1@10.6.2.45','ns_1@10.6.2.39',
                                'ns_1@10.6.2.40'],
                               [{doc,<<"_design/d1">>,
                                     {1,<<56,131,187,118>>},
                                     <<"{\"views\":{\"v1\":{\"map\":\"function(doc){\\nemit(doc.key,doc.key_num);\\n}\"},\"v2\":{\"map\":\"function(doc,meta){\\nemit(meta.id,doc.key);\\n}\"}}}">>,
                                     0,false,[]}],
                               1024,true,undefined,
                               [replica,missing,missing,missing,missing,
                                missing,missing,missing,missing,missing,
                                missing,missing,missing,missing,missing,
                                missing,missing,missing,missing,missing,
                                missing,missing,missing,missing,missing,
                                missing,missing,missing,missing,missing,
                                missing,missing,missing,missing,missing,
                                missing,replica,replica,replica,replica,
                                replica,replica,replica,replica,replica,
                                replica,replica,replica,replica,replica,
                                replica,replica,replica,replica,replica,
                                replica,replica,replica,replica,replica,
                                replica,active,active,active,active,active,
                                active,active,active,active,active,active,
                                active,active,active,active,active,active,
                                active,active,active,active,active,active,
                                active,active,active,active,active,active,
                                active,active,active,active,active,active,
*
* This bug is similar to bug MB-6955

 Comments   
Comment by Thuan Nguyen [ 30/Oct/12 ]
Link to collect info from all nodes https://s3.amazonaws.com/bugdb/jira/MB-7057/orange-ci-1908-reb-exited-noproc-update-vbucket-state-20121030.tgz
Comment by Aleksey Kondratenko [ 31/Oct/12 ]
Something weird happens with mccouch notifications.

Either there's some weird path in ns_server where mc_couch_event subscription of capi_set_view_manager died without evidence. Which seems impossible. I'll add more logging to be doubly sure.

Or ep-engine stopped notifying us about vbucket state changes. Last vbucket state update notification we've seen is:

[views:debug,2012-10-30T7:51:23.336,ns_1@10.6.2.37:mc_couch_events<0.816.0>:capi_set_view_manager:handle_mc_couch_event:529]Got set_vbucket event for default/60. Updated state: replica (34)

Which is many hours before incident.

Comment by Chiyoung Seo [ 31/Oct/12 ]
Another issue that is related to the notification from couch-notifier to erlang mccouch.

I don't think it's a blocker for 2.0 release, but as we discussed, we should add many logs and stats to couch-notifier.
Comment by Dipti Borkar [ 31/Oct/12 ]
Is there a workaround?
Comment by Chiyoung Seo [ 31/Oct/12 ]
I saw lots of reconnection errors to mccouch in 10.6.2.43:

Tue Oct 30 11:15:16.140022 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:15:16.149731 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:15:16.203999 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:15:16.204868 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:18:16.331955 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:18:16.333388 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:18:16.385129 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:18:16.385895 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:21:16.516052 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:21:16.516758 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:21:16.574155 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:21:16.575017 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:29:05.108827 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:29:05.124885 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:29:06.881861 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:29:06.882492 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:29:07.715606 PDT 3: 72 Closing connection due to read error: Connection reset by peer
Tue Oct 30 11:29:07.722856 PDT 3: 55 Closing connection due to read error: Connection reset by peer
Tue Oct 30 11:29:08.312372 PDT 3: 54 Closing connection due to read error: Connection reset by peer
Tue Oct 30 11:32:07.940149 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:32:07.940722 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:32:08.051819 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:32:08.052390 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:35:08.146900 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:35:08.147425 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:35:08.245869 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:35:08.246268 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:38:08.342985 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:38:08.344139 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:38:08.459484 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:38:08.460013 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:41:08.543926 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:41:08.545034 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:41:08.655689 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:41:08.657062 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:44:08.739862 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:44:08.740471 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:44:08.850699 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:44:08.851271 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:47:08.923967 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:47:08.924588 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:47:09.074499 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:47:09.075136 PDT 3: Connected to mccouch: "localhost:11213"
Tue Oct 30 11:50:09.114905 PDT 3: Trying to connect to mccouch: "localhost:11213"
Tue Oct 30 11:50:09.115417 PDT 3: Connected to mccouch: "localhost:11213"
...

We're going to add more logs to couch-notifier in ep-engine.
Comment by Farshid Ghods (Inactive) [ 31/Oct/12 ]
Chiyoung,

not sure if this helps but those messages are related to a node .43 which have crashed and was restarted later before rebalancing ( for that crash we have this bug http://www.couchbase.com/issues/browse/MB-7059 )

once node was restarted and looked healthy we started a new rebalancing . so is the rebalance failure at 17:52 related to the errors from .43 from before crashing ?
Comment by Steve Yen [ 01/Nov/12 ]
from bug-scrub mtg, keeping this in 2.0 for now, to allow ep-engine team more diagnosis results.

see also MB-7046
Comment by Jin Lim [ 01/Nov/12 ]
11/1/12 progress up to date:

* Currently running 4-2 nodes rebalance test adopted from testrunner's perf/reb-1-out.conf
* No issue so far
* System configurations, 5 Centos nodes cluster with 4G memory quota
* 30 concurrent clients data access over 7m items
* 10.5.2.[30...34]
Comment by Aleksey Kondratenko [ 02/Nov/12 ]
Another case of this seems to be MB-7080.

I'm seeing this:

[ns_server:info,2012-11-01T12:03:13.859,ns_1@10.3.3.115:<0.17056.36>:mc_connection:do_delete_vbucket:134]Deleting vbucket: default/677
[ns_server:info,2012-11-01T12:03:13.869,ns_1@10.3.3.115:<0.30102.39>:compaction_daemon:spawn_view_index_compactor:727]Finished compacting indexes for default/_design/test_view-8f45202/replica
[ns_server:debug,2012-11-01T12:03:13.869,ns_1@10.3.3.115:<0.30114.39>:compaction_daemon:file_needs_compaction:831]Estimated size for `default/_design/test_view-8df150f/main`: data 5733629, file 11915770
[ns_server:debug,2012-11-01T12:03:13.870,ns_1@10.3.3.115:<0.30114.39>:compaction_daemon:file_needs_compaction:844]Estimated fragmentation for `default/_design/test_view-8df150f/main`: 6182141 bytes/52%
[ns_server:info,2012-11-01T12:03:13.870,ns_1@10.3.3.115:<0.30114.39>:compaction_daemon:spawn_view_index_compactor:711]Compacting indexes for default/_design/test_view-8df150f/main
[ns_server:debug,2012-11-01T12:03:14.093,ns_1@10.3.3.115:janitor_agent-default<0.17064.36>:janitor_agent:handle_info:682]Got done message from subprocess: <0.30004.39> (ok)
[rebalance:info,2012-11-01T12:03:14.093,ns_1@10.3.3.115:<0.29783.39>:janitor_agent:set_vbucket_state:419]Doing vbucket 677 state change: {'ns_1@10.3.3.121',active,paused,undefined}
[rebalance:info,2012-11-01T12:03:14.202,ns_1@10.3.3.115:<0.29783.39>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 677 on ns_1@10.3.3.121
[rebalance:debug,2012-11-01T12:03:14.209,ns_1@10.3.3.115:<0.30131.39>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 2 in vbucket 677
[ns_server:debug,2012-11-01T12:03:14.216,ns_1@10.3.3.115:janitor_agent-default<0.17064.36>:janitor_agent:handle_info:682]Got done message from subprocess: <0.30131.39> (ok)
[rebalance:info,2012-11-01T12:03:14.216,ns_1@10.3.3.115:<0.30133.39>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.3.3.115 (vbucket 677)
[error_logger:error,2012-11-01T12:03:14.217,ns_1@10.3.3.115:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.30135.39>
    registered_name: []
    exception throw: {error,<<"Partition 677 not in active nor passive set">>}
      in function couch_set_view:monitor_partition_update/3
      in call from capi_set_view_manager:'-do_wait_index_updated/4-fun-0-'/4
      in call from lists:foldl/3
      in call from capi_set_view_manager:do_wait_index_updated/4
    ancestors: ['capi_set_view_manager-default','single_bucket_sup-default',
                  <0.17038.36>]
    messages: []
    links: [<0.17040.36>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 377
    stack_size: 24
    reductions: 214
  neighbours:


I.e. we're observing delete vbucket request for this vbucket from ep-engine via mccouch.

Less than 1 second later we send CMD_CHECKPOINT_PERSISTENCE for this vbucket and receive successful reply.

At this point we assume vbucket exists to disk and that it's existence is already signalled to mccouch. This always happen via vbucket state change notification. Or used to happen.

We can see in this logs that this assumption doesn't hold for some reason. Because we receive reply without first receive mccouch notification which we always log.

That lack of mccouch vbucket state change notification is causing following views request to fail.

Another and perhaps unrelated issue is that deletion of vbucket (which we assume is caused by incoming backfill) is unexpected at this stage. ns_server explicitly does _two_ checkpoints waits to so that vbucket deletion by incoming backfill can only happen at first stage, and cannot affect and negate waiting for index update done at second stage.
Comment by Steve Yen [ 02/Nov/12 ]
raising to blocker, per bug-scrub mtg, as QE's apparently seeing this more often.
Comment by Jin Lim [ 03/Nov/12 ]
The latest debugging update can be found on MB-7046
Comment by Steve Yen [ 05/Nov/12 ]
chiyoung diagnosing
Comment by Chiyoung Seo [ 05/Nov/12 ]
Found some potential race between vbucket creation and deletion and pushed the fix:

http://review.couchbase.org/#/c/22286/

I will ask the QE team to test it after the change is merged.
Comment by Thuan Nguyen [ 05/Nov/12 ]
Integrated in github-ep-engine-2-0 #457 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/457/])
    MB-7057 Schedule VB deletion first and snapshotVBs for resetVB() (Revision 5d1023d9479d0169c97fb3c8f60551bf66569231)

     Result = SUCCESS
Chiyoung Seo :
Files :
* src/ep.cc
Comment by Chiyoung Seo [ 08/Nov/12 ]
I think I finally figured out this issue and pushed the fix into gerrit for review:

http://review.couchbase.org/#/c/22357/
Comment by Thuan Nguyen [ 09/Nov/12 ]
Integrated in github-ep-engine-2-0 #458 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/458/])
    MB-7057 Notify resetVBucket to the mccouch as VB_STATE_CHANGE (Revision 45deae0f6e1564006b9c42b3ac7d2feaa891ae9c)

     Result = SUCCESS
Chiyoung Seo :
Files :
* src/couch-kvstore/couch-kvstore.hh
Comment by Farshid Ghods (Inactive) [ 09/Nov/12 ]
1946+ builds have this fix
Comment by Thuan Nguyen [ 12/Nov/12 ]
HIt this bug again in build 2.0.0-1946 which had change Ib8e15218: MB-7057 Notify resetVBucket to the mccouch as VB_STATE_CHANGE.

I will upload collect info of all nodes to S3 and update link soon.
Comment by Thuan Nguyen [ 12/Nov/12 ]
Create 6 nodes cluster
10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43

Load 40 million items to default bucket that push resident ratio down to around 62% (as of Pavel workload)

Add 2 nodes and rebalance.
Rebalance failed (as expected) due to network on node 43 disconnected

Restart network on node 43 and rebalance.
Rebalance failed with error

** Reason for termination ==
** {{badmatch,{error,{error,<<"Partition 534 not in active nor passive set">>}}},
    [{capi_set_view_manager,handle_call,3},
     {gen_server,handle_msg,5},
     {gen_server,init_it,6},
     {proc_lib,init_p_do_apply,3}]}

[error_logger:error,2012-11-10T8:44:42.090,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: capi_set_view_manager:init/1
    pid: <0.20742.0>
    registered_name: []
    exception exit: {{badmatch,
                         {error,
                             {error,
                                 <<"Partition 534 not in active nor passive set">>}}},
                     [{capi_set_view_manager,handle_call,3},
                      {gen_server,handle_msg,5},
                      {gen_server,init_it,6},
                      {proc_lib,init_p_do_apply,3}]}
      in function gen_server:init_it/6
    ancestors: ['single_bucket_sup-default',<0.20726.0>]
    messages: [{'EXIT',<0.20744.0>,killed}]
    links: [<0.20753.0>,<0.20754.0>,<0.20743.0>,<0.20727.0>]
    dictionary: [{ddoc_replication_proxy,<0.20744.0>}]
    trap_exit: true
    status: running
    heap_size: 6765
    stack_size: 24
    reductions: 37587846
  neighbours:

[error_logger:error,2012-11-10T8:44:42.092,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]Supervisor received unexpected message: {ack,<0.20742.0>,
                                         {error,
                                          {{badmatch,
                                            {error,
                                             {error,
                                              <<"Partition 534 not in active nor passive set">>}}},
                                           [{capi_set_view_manager,
                                             handle_call,3},
                                            {gen_server,handle_msg,5},
                                            {gen_server,init_it,6},
                                            {proc_lib,init_p_do_apply,3}]}}}

[error_logger:error,2012-11-10T8:44:42.093,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
     Supervisor: {local,'single_bucket_sup-default'}
     Context: child_terminated
     Reason: {{badmatch,
                      {error,
                          {error,
                              <<"Partition 534 not in active nor passive set">>}}},
                  [{capi_set_view_manager,handle_call,3},
                   {gen_server,handle_msg,5},
                   {gen_server,init_it,6},
                   {proc_lib,init_p_do_apply,3}]}
     Offender: [{pid,<0.20742.0>},
                  {name,{capi_set_view_manager,"default"}},
                  {mfargs,{capi_set_view_manager,start_link,["default"]}},
                  {restart_type,permanent},
                  {shutdown,1000},
                  {child_type,worker}]


[error_logger:error,2012-11-10T8:44:42.094,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: janitor_agent:-spawn_rebalance_subprocess/3-fun-0-/0
    pid: <0.199.981>
    registered_name: []
    exception exit: {{{badmatch,
                          {error,
                              {error,
                                  <<"Partition 534 not in active nor passive set">>}}},
                      [{capi_set_view_manager,handle_call,3},
                       {gen_server,handle_msg,5},
                       {gen_server,init_it,6},
                       {proc_lib,init_p_do_apply,3}]},
                     {gen_server,call,
                         ['capi_set_view_manager-default',
                          {wait_index_updated,534},
                          infinity]}}
      in function gen_server:call/3
      in call from janitor_agent:'-spawn_rebalance_subprocess/3-fun-0-'/2
    ancestors: ['janitor_agent-default','single_bucket_sup-default',
                  <0.20726.0>]
    messages: []
    links: [<0.20764.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 233
    stack_size: 24
    reductions: 93
  neighbours:
    neighbour: [{pid,<0.20764.0>},
                  {registered_name,'janitor_agent-default'},
                  {initial_call,{janitor_agent,init,['Argument__1']}},
                  {current_function,{gen_server,loop,6}},
                  {ancestors,['single_bucket_sup-default',<0.20726.0>]},
                  {messages,[]},
                  {links,[<0.20727.0>,<0.199.981>]},
                  {dictionary,[]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,17711},
                  {stack_size,9},
                  {reductions,124638767}]

[error_logger:error,2012-11-10T8:44:42.095,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
     Supervisor: {local,'single_bucket_sup-default'}
     Context: child_terminated
     Reason: {{{badmatch,
                       {error,
                           {error,
                               <<"Partition 534 not in active nor passive set">>}}},
                   [{capi_set_view_manager,handle_call,3},
                    {gen_server,handle_msg,5},
                    {gen_server,init_it,6},
                    {proc_lib,init_p_do_apply,3}]},
                  {gen_server,call,
                      ['capi_set_view_manager-default',
                       {wait_index_updated,534},
                       infinity]}}
     Offender: [{pid,<0.20764.0>},
                  {name,{janitor_agent,"default"}},
                  {mfargs,{janitor_agent,start_link,["default"]}},
                  {restart_type,permanent},
                  {shutdown,brutal_kill},
                  {child_type,worker}]


[rebalance:error,2012-11-10T8:44:42.720,ns_1@10.6.2.37:<0.28383.980>:ns_vbucket_mover:handle_info:252]<0.28409.980> exited with {badmatch,wrong_rebalancer_pid}
[ns_server:error,2012-11-10T8:44:42.721,ns_1@10.6.2.37:<0.28407.980>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.28383.980>,
                                   {badmatch,wrong_rebalancer_pid}}
[ns_server:error,2012-11-10T8:44:42.721,ns_1@10.6.2.37:<0.28411.980>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.28383.980>,
                                   {badmatch,wrong_rebalancer_pid}}
[error_logger:error,2012-11-10T8:44:42.722,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.28409.980>
    registered_name: []
    exception error: no match of right hand side value wrong_rebalancer_pid
      in function janitor_agent:set_vbucket_state/7
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.28383.980>,<0.26867.980>]
    messages: []
    links: [<0.28383.980>]
    dictionary: [{cleanup_list,[<0.28746.980>]}]
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 24
    reductions: 4645
  neighbours:


Then re-run rebalance again, rebalance failed with the same error "Partition x not in active nor passive set"

Link to manifest file of build 2.0.0-1946 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1947-rel.rpm.manifest.xml

Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201211/8nodes-ci-1946-reb-failed-Partition-not-in-active-nor-passive-set-20121112-115625.tgz

Comment by Dipti Borkar [ 12/Nov/12 ]
Chiyoung, seems like we reproed this with the patch. Can you please take a look ? Thanks
Comment by Chiyoung Seo [ 12/Nov/12 ]
The one line fix was just pushed for review:

http://review.couchbase.org/#/c/22468/

Comment by Chiyoung Seo [ 13/Nov/12 ]
The fix was reviewed and merged.
Comment by Thuan Nguyen [ 13/Nov/12 ]
Integrated in github-ep-engine-2-0 #459 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/459/])
    MB-7057 Notify vbucket reset event to mccouch (Revision edcb70c52b9af84561fb1a9ed7bb85611fadc8b7)
MB-7057 Fix to the bug in setting vb_state_change message type (Revision a8db69fa759d333246e690c52ef9497c858012ef)

     Result = SUCCESS
Chiyoung Seo :
Files :
* src/couch-kvstore/couch-kvstore.hh

Chiyoung Seo :
Files :
* src/couch-kvstore/couch-kvstore.cc
Comment by Farshid Ghods (Inactive) [ 13/Nov/12 ]
build # 1949 has this fix
Comment by Iryna Mironava [ 16/Nov/12 ]
Tested with 1950
Generated at Mon Sep 01 07:21:35 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.