[RN 2.0.1]][longevity] something unknown is causing severe timeouts in ns_server. Particularly under views building and/or compaction. Which causes rebalance to fail and other types of badness. (MB-6595)

[MB-7225] [system test] rebalance failed with erorr "wait_checkpoint_persisted_failed" due to sync_shutdown_many_i_am_trapping_exits Created: 20/Nov/12  Updated: 14/Aug/13  Resolved: 14/Aug/13

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Technical task Priority: Major
Reporter: Thuan Nguyen Assignee: Jin Lim
Resolution: Fixed Votes: 0
Labels: system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64bit with build 2.0.0-1952


 Description   
Environment:
8 windows server 2008 R2 64 bit in ec2. Each server has 7.5 GB RAM and 45GB EBS disk to store data.
Create a 6 nodes windows cluster in ec2

Loading phase:
* Create one default bucket and load 24 million items to it. Each item has size from 128 to 512 bytes.

Access phase:
* Mutate items with ratio creates/updates/gets/deletes/expirations = 10/60/20/5/5

* Doing rebalance add node, remove node, failover, failover and add back. The last relalance failed when doing remove a node and failover another node at the same time. Then rebalance.
* Rebalance failed with error "sync_shutdown_many_i_am_trapping_exits"

rebalance:debug,2012-11-20T5:02:22.275,ns_1@10.110.206.19:<0.22405.6>:janitor_agent:do_wait_checkpoint_persisted:828]Got etmpfail waiting for checkpoint persistence. Will try again
[ns_server:info,2012-11-20T5:02:22.478,ns_1@10.110.206.19:ns_port_memcached<0.2135.0>:ns_port_server:log:171]memcached<0.2135.0>: Tue Nov 20 05:02:22.284363 GMT Standard Time 3: Notified the timeout on checkpoint persistence for vbucket 881, cookie 0000000005A16B00

[ns_server:info,2012-11-20T5:02:23.585,ns_1@10.110.206.19:<0.22552.6>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default/345">>
[ns_server:info,2012-11-20T5:02:31.370,ns_1@10.110.206.19:<0.10210.2>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.12789.2>,
                                        {dict,7,16,16,8,80,48,
                                         {[],[],[],[],[],[],[],[],[],[],[],[],
                                          [],[],[],[]},
                                         {{[],[],[],[],[],
                                           [['ns_1@10.111.66.215'|1.0],
                                            ['ns_1@10.214.194.95'|
                                             0.9067357512953368]],
                                           [['ns_1@10.96.107.176'|
                                             0.6649484536082475],
                                            ['ns_1@10.110.222.56'|
                                             0.37058823529411766]],
                                           [],
                                           [['ns_1@10.46.181.58'|
                                             0.34705882352941175]],
                                           [['ns_1@10.2.63.249'|
                                             0.9642857142857143],
                                            ['ns_1@10.110.206.19'|1.0]],
                                           [],[],[],[],[],[]}}},
                                        ['ns_1@10.2.63.249',
                                         'ns_1@10.96.107.176',
                                         'ns_1@10.110.206.19',
                                         'ns_1@10.111.66.215',
                                         'ns_1@10.214.194.95',
                                         'ns_1@10.110.222.56'],
                                        ['ns_1@10.46.181.58'],
                                        []}
[ns_server:debug,2012-11-20T5:02:34.037,ns_1@10.110.206.19:janitor_agent-default<0.2184.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.22405.6> (ok)
[ns_server:info,2012-11-20T5:02:34.256,ns_1@10.110.206.19:ns_port_memcached<0.2135.0>:ns_port_server:log:171]memcached<0.2135.0>: Tue Nov 20 05:02:34.015714 GMT Standard Time 3: Notified the completion of checkpoint persistence for vbucket 881, cookie 0000000005A16B00

[ns_server:error,2012-11-20T5:02:35.847,ns_1@10.110.206.19:ns_doctor<0.2076.0>:ns_doctor:update_status:205]The following buckets became not ready on node 'ns_1@10.111.66.215': ["default"], those of them are active []
[ns_server:debug,2012-11-20T5:02:36.003,ns_1@10.110.206.19:capi_set_view_manager-default<0.2157.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-11-20T5:02:36.034,ns_1@10.110.206.19:xdc_rdoc_replication_srv<0.2167.0>:xdc_rdoc_replication_srv:handle_info:132]doing replicate_newnodes_docs
[ns_server:debug,2012-11-20T5:02:36.034,ns_1@10.110.206.19:xdc_rdoc_replication_srv<0.2167.0>:xdc_rdoc_replication_srv:replicate_change_to_node:160]Sending _design/_replicator_info to ns_1@10.111.66.215
[ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.20550.6>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<15693.31507.5>,killed}]
[ns_server:info,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.20550.6>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.214.194.95': [<<"replication_building_916_'ns_1@10.96.107.176'">>,
                                                         <<"replication_building_916_'ns_1@10.111.66.215'">>]
[error_logger:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.20550.6> terminating
** Last message in was {'EXIT',<0.20543.6>,shutdown}
** When Server state == {state,"default",916,'ns_1@10.214.194.95',
                               [{'ns_1@10.96.107.176',<15680.9900.6>},
                                {'ns_1@10.111.66.215',<15693.31507.5>}]}
** Reason for termination ==
** {{badmatch,[{<15693.31507.5>,killed}]},
    [{misc,sync_shutdown_many_i_am_trapping_exits,1},
     {misc,try_with_maybe_ignorant_after,2},
     {gen_server,terminate,6},
     {proc_lib,init_p_do_apply,3}]}

[ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.20543.6>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.20550.6>,
                                    {{badmatch,[{<15693.31507.5>,killed}]},
                                     [{misc,
                                       sync_shutdown_many_i_am_trapping_exits,
                                       1},
                                      {misc,try_with_maybe_ignorant_after,2},
                                      {gen_server,terminate,6},
                                      {proc_lib,init_p_do_apply,3}]}},
                                   {<0.22253.6>,
                                    {{wait_checkpoint_persisted_failed,
                                      "default",916,249,
                                      [{'ns_1@10.111.66.215',
                                        {'EXIT',
                                         {noproc,
                                          {gen_server,call,
                                           [{'janitor_agent-default',
                                             'ns_1@10.111.66.215'},
                                            {if_rebalance,<0.12877.2>,
                                             {wait_checkpoint_persisted,916,
                                              249}},
                                            infinity]}}}}]},
                                     [{ns_single_vbucket_mover,
                                       '-wait_checkpoint_persisted_many/5-fun-1-',
                                       5}]}}]
[ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.20543.6>:misc:try_with_maybe_ignorant_after:1444]Eating exception from ignorant after-block:
{error,
    {badmatch,
        [{<0.20550.6>,
          {{badmatch,[{<15693.31507.5>,killed}]},
           [{misc,sync_shutdown_many_i_am_trapping_exits,1},
            {misc,try_with_maybe_ignorant_after,2},
            {gen_server,terminate,6},
            {proc_lib,init_p_do_apply,3}]}},
         {<0.22253.6>,
          {{wait_checkpoint_persisted_failed,"default",916,249,
               [{'ns_1@10.111.66.215',
                    {'EXIT',
                        {noproc,
                            {gen_server,call,
                                [{'janitor_agent-default',
                                     'ns_1@10.111.66.215'},
                                 {if_rebalance,<0.12877.2>,
                                     {wait_checkpoint_persisted,916,249}},
                                 infinity]}}}}]},
           [{ns_single_vbucket_mover,
                '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}]},
    [{misc,sync_shutdown_many_i_am_trapping_exits,1},
     {misc,try_with_maybe_ignorant_after,2},
     {ns_single_vbucket_mover,mover,6},
     {proc_lib,init_p_do_apply,3}]}
[rebalance:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.12877.2>:ns_vbucket_mover:handle_info:252]<0.20543.6> exited with {unexpected_exit,
                         {'EXIT',<0.22253.6>,
                          {{wait_checkpoint_persisted_failed,"default",916,
                            249,
                            [{'ns_1@10.111.66.215',
                              {'EXIT',
                               {noproc,
                                {gen_server,call,
                                 [{'janitor_agent-default',
                                   'ns_1@10.111.66.215'},
                                  {if_rebalance,<0.12877.2>,
                                   {wait_checkpoint_persisted,916,249}},
                                  infinity]}}}}]},
                           [{ns_single_vbucket_mover,
                             '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}
[ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22383.6>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.12877.2>,
                            {unexpected_exit,
                             {'EXIT',<0.22253.6>,
                              {{wait_checkpoint_persisted_failed,"default",
                                916,249,
                                [{'ns_1@10.111.66.215',
                                  {'EXIT',
                                   {noproc,
                                    {gen_server,call,
                                     [{'janitor_agent-default',
                                       'ns_1@10.111.66.215'},
                                      {if_rebalance,<0.12877.2>,
                                       {wait_checkpoint_persisted,916,249}},
                                      infinity]}}}}]},
                               [{ns_single_vbucket_mover,
                                 '-wait_checkpoint_persisted_many/5-fun-1-',
                                 5}]}}}}
[ns_server:debug,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22388.6>:ebucketmigrator_srv:confirm_sent_messages:710]Going to wait for reception of opaque message ack
[ns_server:debug,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22664.6>:ebucketmigrator_srv:confirm_sent_messages:705]Sending opaque message to confirm downstream reception
[ns_server:debug,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22664.6>:ebucketmigrator_srv:confirm_sent_messages:707]Sent fine
[ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22351.6>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.12877.2>,
                            {unexpected_exit,
                             {'EXIT',<0.22253.6>,
                              {{wait_checkpoint_persisted_failed,"default",
                                916,249,
                                [{'ns_1@10.111.66.215',
                                  {'EXIT',
                                   {noproc,
                                    {gen_server,call,
                                     [{'janitor_agent-default',
                                       'ns_1@10.111.66.215'},
                                      {if_rebalance,<0.12877.2>,
                                       {wait_checkpoint_persisted,916,249}},
                                      infinity]}}}}]},
                               [{ns_single_vbucket_mover,
                                 '-wait_checkpoint_persisted_many/5-fun-1-',
                                 5}]}}}}
[ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22475.6>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.12877.2>,
                            {unexpected_exit,
                             {'EXIT',<0.22253.6>,
                              {{wait_checkpoint_persisted_failed,"default",
                                916,249,
                                [{'ns_1@10.111.66.215',
                                  {'EXIT',
                                   {noproc,
                                    {gen_server,call,
                                     [{'janitor_agent-default',
                                       'ns_1@10.111.66.215'},
                                      {if_rebalance,<0.12877.2>,
                                       {wait_checkpoint_persisted,916,249}},
                                      infinity]}}}}]},
                               [{ns_single_vbucket_mover,
                                 '-wait_checkpoint_persisted_many/5-fun-1-',
                                 5}]}}}}
[rebalance:info,2012-11-20T5:02:38.764,ns_1@10.110.206.19:<0.22388.6>:ebucketmigrator_srv:do_confirm_sent_messages:684]Got close ack!

[error_logger:error,2012-11-20T5:02:38.764,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: new_ns_replicas_builder:init/1
    pid: <0.20550.6>
    registered_name: []
    exception exit: {{badmatch,[{<15693.31507.5>,killed}]},
                     [{misc,sync_shutdown_many_i_am_trapping_exits,1},
                      {misc,try_with_maybe_ignorant_after,2},
                      {gen_server,terminate,6},
                      {proc_lib,init_p_do_apply,3}]}
      in function gen_server:terminate/6
    ancestors: [<0.20543.6>,<0.12877.2>,<0.12789.2>]
    messages: []
    links: [<0.20543.6>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 317811
    stack_size: 24
    reductions: 32264
  neighbours:

[error_logger:error,2012-11-20T5:02:38.764,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.20543.6>
    registered_name: []
    exception exit: {unexpected_exit,
                     {'EXIT',<0.22253.6>,
                      {{wait_checkpoint_persisted_failed,"default",916,249,
                        [{'ns_1@10.111.66.215',
                          {'EXIT',
                           {noproc,
                            {gen_server,call,
                             [{'janitor_agent-default','ns_1@10.111.66.215'},
                              {if_rebalance,<0.12877.2>,
                               {wait_checkpoint_persisted,916,249}},
                              infinity]}}}}]},
                       [{ns_single_vbucket_mover,
                         '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.12877.2>,<0.12789.2>]
    messages: []
    links: [<0.12877.2>]
    dictionary: [{cleanup_list,[<0.20550.6>,<0.22253.6>]}]
    trap_exit: true
    status: running
    heap_size: 4181
    stack_size: 24
    reductions: 12536
  neighbours:

[ns_server:info,2012-11-20T5:02:38.780,ns_1@10.110.206.19:<0.22387.6>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.2.63.249': [<<"replication_building_881_'ns_1@10.96.107.176'">>,
                                                       <<"replication_building_881_'ns_1@10.110.206.19'">>]
[error_logger:error,2012-11-20T5:02:38.780,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.22383.6>
    registered_name: []
    exception exit: {unexpected_exit,
                     {'EXIT',<0.12877.2>,
                      {unexpected_exit,
                       {'EXIT',<0.22253.6>,
                        {{wait_checkpoint_persisted_failed,"default",916,249,
                          [{'ns_1@10.111.66.215',
                            {'EXIT',
                             {noproc,
                              {gen_server,call,
                               [{'janitor_agent-default','ns_1@10.111.66.215'},
                                {if_rebalance,<0.12877.2>,
                                 {wait_checkpoint_persisted,916,249}},
                                infinity]}}}}]},
                         [{ns_single_vbucket_mover,
                           '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.12877.2>,<0.12789.2>]
    messages: [{'EXIT',<0.12877.2>,
                   {unexpected_exit,
                    {'EXIT',<0.22253.6>,
                     {{wait_checkpoint_persisted_failed,"default",916,249,
                       [{'ns_1@10.111.66.215',
                         {'EXIT',
                          {noproc,
                           {gen_server,call,
                            [{'janitor_agent-default','ns_1@10.111.66.215'},
                             {if_rebalance,<0.12877.2>,
                              {wait_checkpoint_persisted,916,249}},
                             infinity]}}}}]},
                      [{ns_single_vbucket_mover,
                        '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}]
    links: [<0.12877.2>]
    dictionary: [{cleanup_list,[<0.22387.6>,<0.22401.6>]}]
    trap_exit: true
    status: running
    heap_size: 377
    stack_size: 24
    reductions: 5554
  neighbours:

[ns_server:info,2012-11-20T5:02:38.795,ns_1@10.110.206.19:<0.22478.6>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.96.107.176': [<<"replication_building_990_'ns_1@10.110.222.56'">>]
[error_logger:error,2012-11-20T5:02:38.795,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.22475.6>
    registered_name: []
    exception exit: {unexpected_exit,
                     {'EXIT',<0.12877.2>,
                      {unexpected_exit,
                       {'EXIT',<0.22253.6>,
                        {{wait_checkpoint_persisted_failed,"default",916,249,
                          [{'ns_1@10.111.66.215',
                            {'EXIT',
                             {noproc,
                              {gen_server,call,
                               [{'janitor_agent-default','ns_1@10.111.66.215'},
                                {if_rebalance,<0.12877.2>,
                                 {wait_checkpoint_persisted,916,249}},
                                infinity]}}}}]},
                         [{ns_single_vbucket_mover,
                           '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.12877.2>,<0.12789.2>]
    messages: [{'EXIT',<0.12877.2>,
                   {unexpected_exit,
                    {'EXIT',<0.22253.6>,
                     {{wait_checkpoint_persisted_failed,"default",916,249,
                       [{'ns_1@10.111.66.215',
                         {'EXIT',
                          {noproc,
                           {gen_server,call,
                            [{'janitor_agent-default','ns_1@10.111.66.215'},
                             {if_rebalance,<0.12877.2>,
                              {wait_checkpoint_persisted,916,249}},
                             infinity]}}}}]},
                      [{ns_single_vbucket_mover,
                        '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}]
    links: [<0.12877.2>,<0.22663.6>]
    dictionary: [{cleanup_list,[<0.22478.6>,<0.22493.6>]}]
    trap_exit: true
    status: running
    heap_size: 4181
    stack_size: 24
    reductions: 4822
  neighbours:

[ns_server:info,2012-11-20T5:02:38.827,ns_1@10.110.206.19:<0.22355.6>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.46.181.58': [<<"replication_building_679_'ns_1@10.214.194.95'">>,
                                                        <<"replication_building_679_'ns_1@10.96.107.176'">>]
[error_logger:error,2012-11-20T5:02:38.827,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.22351.6>
    registered_name: []
    exception exit: {unexpected_exit,
                     {'EXIT',<0.12877.2>,
                      {unexpected_exit,
                       {'EXIT',<0.22253.6>,
                        {{wait_checkpoint_persisted_failed,"default",916,249,
                          [{'ns_1@10.111.66.215',
                            {'EXIT',
                             {noproc,
                              {gen_server,call,
                               [{'janitor_agent-default','ns_1@10.111.66.215'},
                                {if_rebalance,<0.12877.2>,
                                 {wait_checkpoint_persisted,916,249}},
                                infinity]}}}}]},
                         [{ns_single_vbucket_mover,
                           '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.12877.2>,<0.12789.2>]
    messages: [{'EXIT',<0.12877.2>,
                   {unexpected_exit,
                    {'EXIT',<0.22253.6>,
                     {{wait_checkpoint_persisted_failed,"default",916,249,
                       [{'ns_1@10.111.66.215',
                         {'EXIT',
                          {noproc,
                           {gen_server,call,
                            [{'janitor_agent-default','ns_1@10.111.66.215'},
                             {if_rebalance,<0.12877.2>,
                              {wait_checkpoint_persisted,916,249}},
                             infinity]}}}}]},
                      [{ns_single_vbucket_mover,
                        '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}]
    links: [<0.12877.2>,<0.22671.6>]
    dictionary: [{cleanup_list,[<0.22355.6>,<0.22393.6>]}]
    trap_exit: true
    status: running
    heap_size: 377
    stack_size: 24
    reductions: 5554
  neighbours:

[ns_server:debug,2012-11-20T5:02:38.827,ns_1@10.110.206.19:<0.12884.2>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_node_disco_events,<0.12877.2>} exited with reason {unexpected_exit,
                                                                                      {'EXIT',
                                                                                       <0.22253.6>,
                                                                                       {{wait_checkpoint_persisted_failed,
                                                                                         "default",
                                                                                         916,
                                                                                         249,
                                                                                         [{'ns_1@10.111.66.215',
                                                                                           {'EXIT',
                                                                                            {noproc,
                                                                                             {gen_server,
                                                                                              call,
                                                                                              [{'janitor_agent-default',
                                                                                                'ns_1@10.111.66.215'},
                                                                                               {if_rebalance,
                                                                                                <0.12877.2>,
                                                                                                {wait_checkpoint_persisted,
                                                                                                 916,
                                                                                                 249}},
                                                                                               infinity]}}}}]},
                                                                                        [{ns_single_vbucket_mover,
                                                                                          '-wait_checkpoint_persisted_many/5-fun-1-',
                                                                                          5}]}}}
[user:info,2012-11-20T5:02:38.842,ns_1@10.110.206.19:<0.10210.2>:ns_orchestrator:handle_info:319]Rebalance exited with reason {unexpected_exit,
                              {'EXIT',<0.22253.6>,
                               {{wait_checkpoint_persisted_failed,"default",
                                 916,249,
                                 [{'ns_1@10.111.66.215',
                                   {'EXIT',
                                    {noproc,
                                     {gen_server,call,
                                      [{'janitor_agent-default',
                                        'ns_1@10.111.66.215'},
                                       {if_rebalance,<0.12877.2>,
                                        {wait_checkpoint_persisted,916,249}},
                                       infinity]}}}}]},
                                [{ns_single_vbucket_mover,
                                  '-wait_checkpoint_persisted_many/5-fun-1-',
                                  5}]}}}

[ns_server:info,2012-11-20T5:02:38.842,ns_1@10.110.206.19:<0.23178.6>:diag_handler:log_all_tap_and_checkpoint_stats:127]logging tap & checkpoint stats
[ns_server:info,2012-11-20T5:02:38.858,ns_1@10.110.206.19:janitor_agent-default<0.2184.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
[ns_server:debug,2012-11-20T5:02:38.858,ns_1@10.110.206.19:capi_set_view_manager-default<0.2157.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:info,2012-11-20T5:02:38.998,ns_1@10.110.206.19:ns_port_memcached<0.2135.0>:ns_port_server:log:171]memcached<0.2135.0>: Tue Nov 20 05:02:38.773775 GMT Standard Time 3: TAP (Consumer) eq_tapq:anon_402 - disconnected

[error_logger:error,2012-11-20T5:02:38.827,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.12877.2> terminating
** Last message in was {'EXIT',<0.20543.6>,
                        {unexpected_exit,
                         {'EXIT',<0.22253.6>,
                          {{wait_checkpoint_persisted_failed,"default",916,
                            249,
                            [{'ns_1@10.111.66.215',
                              {'EXIT',
                               {noproc,
                                {gen_server,call,
                                 [{'janitor_agent-default',
                                   'ns_1@10.111.66.215'},
                                  {if_rebalance,<0.12877.2>,
                                   {wait_checkpoint_persisted,916,249}},
                                  infinity]}}}}]},
                           [{ns_single_vbucket_mover,
                             '-wait_checkpoint_persisted_many/5-fun-1-',
                             5}]}}}}


Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1952-rel.setup.exe.manifest.xml
** Do rebalance again, rebalance passed.


 Comments   
Comment by Thuan Nguyen [ 20/Nov/12 ]
Link to collect info of all nodes https://s3.amazonaws.com/bugdb/jira/MB-7225/8nodes-ci-1952-reb-wait_checkpoint_persisted_failed-20121120-131457.tgz
Comment by Jin Lim [ 20/Nov/12 ]
Assign it to NS_SERVER team for triaging first from their end. It appears to me that the shutdown of bucket got stuck (or failed) somewhere. Please assign back to me afterwards. Thanks!
Comment by Dipti Borkar [ 21/Nov/12 ]
Bug Scrub: Aliaksey A can you take a look at this for 2.0.1 ?
Comment by Andrei Baranouski [ 07/Dec/12 ]
seems like the same issue
./testrunner -i /tmp/rebalance_regression.ini get-logs=True,wait_timeout=100 -t swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=2,num-buckets=3,num-swap=1,keys-count=1000000,swap-orchestrator=False
http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/151/consoleFull

[2012-12-06 13:29:05,721] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.98%2Cns_1%4010.3.121.96%2Cns_1%4010.3.121.93%2Cns_1%4010.3.121.97%2Cns_1%4010.3.121.95
[2012-12-06 13:29:05,888] - [rest_client:890] INFO - rebalance operation started
[2012-12-06 13:29:06,158] - [rest_client:986] INFO - rebalance percentage : 0 %
....
2012-12-06 14:02:21,927] - [rest_client:986] INFO - rebalance percentage : 28.3558937832 %
[2012-12-06 14:02:22,938] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-1
[2012-12-06 14:02:23,048] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-2
[2012-12-06 14:02:23,209] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-0
[2012-12-06 14:02:25,825] - [rest_client:986] INFO - rebalance percentage : 28.3558937832 %
[2012-12-06 14:02:26,952] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-1
[2012-12-06 14:02:27,191] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-2
[2012-12-06 14:02:27,608] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-0
[2012-12-06 14:02:29,619] - [rest_client:971] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
ERROR
RebalanceFailedException: Rebalance Failed: {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed






[ns_server:error,2012-12-06T14:04:58.426,ns_1@10.3.121.93:<0.9035.81>:misc:inner_wait_shutdown:1426]Expected exit signal from <0.9040.81> but could not get it in 5 seconds. This is a bug, but process we're waiting for is dead (noproc), so trying to ignore...
[ns_server:debug,2012-12-06T14:04:58.475,ns_1@10.3.121.93:<0.9035.81>:misc:inner_wait_shutdown:1427]Here's messages:
{messages,[{'EXIT',<0.8898.81>,shutdown}]}
[ns_server:error,2012-12-06T14:04:58.475,ns_1@10.3.121.93:<0.9035.81>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.9040.81>,noproc}]
[stats:warn,2012-12-06T14:04:58.811,ns_1@10.3.121.93:<0.15386.38>:stats_collector:latest_tick:223]Dropped 1 ticks
[stats:warn,2012-12-06T14:04:58.811,ns_1@10.3.121.93:<0.15338.38>:stats_collector:latest_tick:223]Dropped 1 ticks
[ns_server:info,2012-12-06T14:04:58.837,ns_1@10.3.121.93:<0.9035.81>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.92': [<<"replication_building_736_'ns_1@10.3.121.97'">>,
                                                       <<"replication_building_736_'ns_1@10.3.121.93'">>]
[ns_server:error,2012-12-06T14:04:58.838,ns_1@10.3.121.93:<0.8898.81>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.9035.81>,
                                    {{badmatch,[{<0.9040.81>,noproc}]},
                                     [{misc,
                                       sync_shutdown_many_i_am_trapping_exits,
                                       1},
                                      {misc,try_with_maybe_ignorant_after,2},
                                      {gen_server,terminate,6},
                                      {proc_lib,init_p_do_apply,3}]}},
                                   {<0.9256.81>,
                                    {{wait_checkpoint_persisted_failed,
                                      "bucket-2",736,10,
                                      [{'ns_1@10.3.121.93',
                                        {'EXIT',
                                         {{{{badmatch,{error,closed}},
                                            [{mc_client_binary,
                                              cmd_binary_vocal_recv,5},
                                             {mc_client_binary,select_bucket,
                                              2},
                                             {ns_memcached,ensure_bucket,2},
                                             {ns_memcached,handle_info,2},
                                             {gen_server,handle_msg,5},
                                             {proc_lib,init_p_do_apply,3}]},
                                           {gen_server,call,
                                            ['ns_memcached-bucket-2',
                                             {set_vbucket,811,replica},
                                             180000]}},
                                          {gen_server,call,
                                           [{'janitor_agent-bucket-2',
                                             'ns_1@10.3.121.93'},
                                            {if_rebalance,<0.18991.71>,
                                             {wait_checkpoint_persisted,736,
                                              10}},
                                            infinity]}}}}]},
                                     [{ns_single_vbucket_mover,
                                       '-wait_checkpoint_persisted_many/5-fun-1-',
                                       5}]}}]
[ns_server:error,2012-12-06T14:04:58.840,ns_1@10.3.121.93:<0.8898.81>:misc:try_with_maybe_ignorant_after:1444]Eating exception from ignorant after-block:
{error,
    {badmatch,
        [{<0.9035.81>,
          {{badmatch,[{<0.9040.81>,noproc}]},
           [{misc,sync_shutdown_many_i_am_trapping_exits,1},
            {misc,try_with_maybe_ignorant_after,2},
            {gen_server,terminate,6},
            {proc_lib,init_p_do_apply,3}]}},
         {<0.9256.81>,
          {{wait_checkpoint_persisted_failed,"bucket-2",736,10,
               [{'ns_1@10.3.121.93',
                    {'EXIT',
                        {{{{badmatch,{error,closed}},
                           [{mc_client_binary,cmd_binary_vocal_recv,5},
                            {mc_client_binary,select_bucket,2},
                            {ns_memcached,ensure_bucket,2},
                            {ns_memcached,handle_info,2},
                            {gen_server,handle_msg,5},
                            {proc_lib,init_p_do_apply,3}]},
                          {gen_server,call,
                              ['ns_memcached-bucket-2',
                               {set_vbucket,811,replica},
                               180000]}},
                         {gen_server,call,
                             [{'janitor_agent-bucket-2','ns_1@10.3.121.93'},
                              {if_rebalance,<0.18991.71>,
                                  {wait_checkpoint_persisted,736,10}},
                              infinity]}}}}]},
           [{ns_single_vbucket_mover,
                '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}]},
    [{misc,sync_shutdown_many_i_am_trapping_exits,1},
     {misc,try_with_maybe_ignorant_after,2},
     {ns_single_vbucket_mover,mover,6},
     {proc_lib,init_p_do_apply,3}]}
[error_logger:error,2012-12-06T14:04:58.854,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.9035.81> terminating
** Last message in was {'EXIT',<0.9040.81>,downstream_closed}
** When Server state == {state,"bucket-2",736,'ns_1@10.3.121.92',
                               [{'ns_1@10.3.121.97',<20182.18667.52>},
                                {'ns_1@10.3.121.93',<0.9040.81>}]}
** Reason for termination ==
** {{badmatch,[{<0.9040.81>,noproc}]},
    [{misc,sync_shutdown_many_i_am_trapping_exits,1},
     {misc,try_with_maybe_ignorant_after,2},
     {gen_server,terminate,6},
     {proc_lib,init_p_do_apply,3}]}

[error_logger:error,2012-12-06T14:04:58.856,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: new_ns_replicas_builder:init/1
    pid: <0.9035.81>
    registered_name: []
    exception exit: {{badmatch,[{<0.9040.81>,noproc}]},
                     [{misc,sync_shutdown_many_i_am_trapping_exits,1},
                      {misc,try_with_maybe_ignorant_after,2},
                      {gen_server,terminate,6},
                      {proc_lib,init_p_do_apply,3}]}
      in function gen_server:terminate/6
    ancestors: [<0.8898.81>,<0.18991.71>,<0.18809.71>]
    messages: [{'EXIT',<0.8898.81>,shutdown}]
    links: [<0.8898.81>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 514229
    stack_size: 24
    reductions: 43190
  neighbours:

[error_logger:error,2012-12-06T14:04:58.878,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.8898.81>
    registered_name: []
    exception exit: {unexpected_exit,
                     {'EXIT',<0.9256.81>,
                      {{wait_checkpoint_persisted_failed,"bucket-2",736,10,
                        [{'ns_1@10.3.121.93',
                          {'EXIT',
                           {{{{badmatch,{error,closed}},
                              [{mc_client_binary,cmd_binary_vocal_recv,5},
                               {mc_client_binary,select_bucket,2},
                               {ns_memcached,ensure_bucket,2},
                               {ns_memcached,handle_info,2},
                               {gen_server,handle_msg,5},
                               {proc_lib,init_p_do_apply,3}]},
                             {gen_server,call,
                              ['ns_memcached-bucket-2',
                               {set_vbucket,811,replica},
                               180000]}},
                            {gen_server,call,
                             [{'janitor_agent-bucket-2','ns_1@10.3.121.93'},
                              {if_rebalance,<0.18991.71>,
                               {wait_checkpoint_persisted,736,10}},
                              infinity]}}}}]},
                       [{ns_single_vbucket_mover,
                         '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.18991.71>,<0.18809.71>]
    messages: [{'EXIT',<0.18991.71>,
                   {{bulk_set_vbucket_state_failed,
                     [{'ns_1@10.3.121.93',
                       {'EXIT',
                        {{{{badmatch,{error,closed}},
                           [{mc_client_binary,cmd_binary_vocal_recv,5},
                            {mc_client_binary,select_bucket,2},
                            {ns_memcached,ensure_bucket,2},
                            {ns_memcached,handle_info,2},
                            {gen_server,handle_msg,5},
                            {proc_lib,init_p_do_apply,3}]},
                          {gen_server,call,
                           ['ns_memcached-bucket-2',
                            {set_vbucket,811,replica},
                            180000]}},
                         {gen_server,call,
                          [{'janitor_agent-bucket-2','ns_1@10.3.121.93'},
                           {if_rebalance,<0.18991.71>,
                            {update_vbucket_state,811,replica,undefined,
                             'ns_1@10.3.121.95'}},
                           infinity]}}}}]},
                    [{janitor_agent,bulk_set_vbucket_state,4},
                     {ns_vbucket_mover,update_replication_post_move,3},
                     {ns_vbucket_mover,handle_info,2},
                     {gen_server,handle_msg,5},
                     {proc_lib,init_p_do_apply,3}]}}]
    links: [<0.18991.71>,<0.9800.81>]
    dictionary: [{cleanup_list,[<0.9035.81>,<0.9256.81>]}]
    trap_exit: true
    status: running
    heap_size: 6765
    stack_size: 24
    reductions: 14002
  neighbours:

[user:info,2012-12-06T14:04:58.905,ns_1@10.3.121.93:<0.2696.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {{bulk_set_vbucket_state_failed,
                               [{'ns_1@10.3.121.93',
                                 {'EXIT',
                                  {{{{badmatch,{error,closed}},
                                     [{mc_client_binary,
                                       cmd_binary_vocal_recv,5},
                                      {mc_client_binary,select_bucket,2},
                                      {ns_memcached,ensure_bucket,2},
                                      {ns_memcached,handle_info,2},
                                      {gen_server,handle_msg,5},
                                      {proc_lib,init_p_do_apply,3}]},
                                    {gen_server,call,
                                     ['ns_memcached-bucket-2',
                                      {set_vbucket,811,replica},
                                      180000]}},
                                   {gen_server,call,
                                    [{'janitor_agent-bucket-2',
                                      'ns_1@10.3.121.93'},
                                     {if_rebalance,<0.18991.71>,
                                      {update_vbucket_state,811,replica,
                                       undefined,'ns_1@10.3.121.95'}},
                                     infinity]}}}}]},
                              [{janitor_agent,bulk_set_vbucket_state,4},
                               {ns_vbucket_mover,
                                update_replication_post_move,3},
                               {ns_vbucket_mover,handle_info,2},
                               {gen_server,handle_msg,5},
                               {proc_lib,init_p_do_apply,3}]}

[ns_server:debug,2012-12-06T14:04:58.912,ns_1@10.3.121.93:<0.19026.71>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_node_disco_events,<0.18991.71>} exited with reason {{bulk_set_vbucket_state_failed,
                                                                                        [{'ns_1@10.3.121.93',
                                                                                          {'EXIT',
                                                                                           {{{{badmatch,
                                                                                               {error,
                                                                                                closed}},
                                                                                              [{mc_client_binary,
                                                                                                cmd_binary_vocal_recv,
                                                                                                5},
                                                                                               {mc_client_binary,
                                                                                                select_bucket,
                                                                                                2},
                                                                                               {ns_memcached,
                                                                                                ensure_bucket,
                                                                                                2},
                                                                                               {ns_memcached,
                                                                                                handle_info,
                                                                                                2},
                                                                                               {gen_server,
                                                                                                handle_msg,
                                                                                                5},
                                                                                               {proc_lib,
                                                                                                init_p_do_apply,
                                                                                                3}]},
                                                                                             {gen_server,
                                                                                              call,
                                                                                              ['ns_memcached-bucket-2',
                                                                                               {set_vbucket,
                                                                                                811,
                                                                                                replica},
                                                                                               180000]}},
                                                                                            {gen_server,
                                                                                             call,
                                                                                             [{'janitor_agent-bucket-2',
                                                                                               'ns_1@10.3.121.93'},
                                                                                              {if_rebalance,
                                                                                               <0.18991.71>,
                                                                                               {update_vbucket_state,
                                                                                                811,
                                                                                                replica,
                                                                                                undefined,
                                                                                                'ns_1@10.3.121.95'}},
                                                                                              infinity]}}}}]},
                                                                                       [{janitor_agent,
                                                                                         bulk_set_vbucket_state,
                                                                                         4},
                                                                                        {ns_vbucket_mover,
                                                                                         update_replication_post_move,
                                                                                         3},
                                                                                        {ns_vbucket_mover,
                                                                                         handle_info,
                                                                                         2},
                                                                                        {gen_server,
                                                                                         handle_msg,
                                                                                         5},
                                                                                        {proc_lib,
                                                                                         init_p_do_apply,
                                                                                         3}]}
[stats:warn,2012-12-06T14:04:58.920,ns_1@10.3.121.93:<0.15437.38>:stats_collector:latest_tick:223]Dropped 1 ticks
[ns_server:debug,2012-12-06T14:04:58.998,ns_1@10.3.121.93:capi_set_view_manager-bucket-0<0.15312.38>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-12-06T14:04:58.999,ns_1@10.3.121.93:capi_set_view_manager-bucket-1<0.15355.38>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:debug,2012-12-06T14:04:58.999,ns_1@10.3.121.93:capi_set_view_manager-bucket-2<0.15405.38>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
[ns_server:info,2012-12-06T14:04:59.061,ns_1@10.3.121.93:<0.10836.81>:diag_handler:log_all_tap_and_checkpoint_stats:127]logging tap & checkpoint stats
[ns_server:debug,2012-12-06T14:04:59.122,ns_1@10.3.121.93:ns_config_rep<0.360.0>:ns_config_rep:do_push_keys:317]Replicating some config keys ([counters,rebalance_status,rebalancer_pid]..)
[error_logger:error,2012-12-06T14:04:58.885,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.18991.71> terminating

https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.92-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.93-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.94-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.95-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.96-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.97-diag.txt.gz
https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.98-diag.txt.gz


Comment by Andrei Baranouski [ 07/Dec/12 ]
test on centos 64
Comment by Jin Lim [ 07/Dec/12 ]
ep engine bt from the crashing thread:

#0 0x00007ff450f1e2d2 in setQueuedTime (this=0x3d8b79a0, qi=..., checkpointManager=<value optimized out>)
    at src/queueditem.hh:54
#1 Checkpoint::queueDirty (this=0x3d8b79a0, qi=..., checkpointManager=<value optimized out>) at src/checkpoint.cc:118
#2 0x00007ff450f20658 in CheckpointManager::closeOpenCheckpoint_UNLOCKED (this=0x96774a8, id=9) at src/checkpoint.cc:315
#3 0x00007ff450f206d7 in CheckpointManager::closeOpenCheckpoint (this=0x96774a8, id=9) at src/checkpoint.cc:323
#4 0x00007ff450f7c0b2 in TapConsumer::processCheckpointCommand(<anonymous enum>, uint16_t, uint64_t) (this=0x6ad3930,
    event=TAP_CHECKPOINT_END, vbucket=128, checkpointId=9) at src/tapconnection.cc:1490
#5 0x00007ff450f50aa2 in EventuallyPersistentEngine::tapNotify(const void *, void *, uint16_t, uint8_t, uint16_t, <anonymous enum>, uint32_t, const void *, size_t, uint32_t, uint32_t, uint64_t, const void *, size_t, uint16_t) (this=0xbcfd600,
    cookie=<value optimized out>, engine_specific=<value optimized out>, nengine=<value optimized out>, tap_flags=1,
---Type <return> to continue, or q <return> to quit---
    tap_event=TAP_CHECKPOINT_END, tap_seqno=106326, key=0x7a4e028, nkey=14, flags=0, exptime=0, cas=0, data=0x7a4e036,
    ndata=8, vbucket=30850) at src/ep_engine.cc:2010
#6 0x00007ff450f515a8 in EvpTapNotify(ENGINE_HANDLE *, const void *, void *, uint16_t, uint8_t, uint16_t, <anonymous enum>, uint32_t, const void *, size_t, uint32_t, uint32_t, uint64_t, const void *, size_t, uint16_t) (handle=0xbcfd600,
    cookie=0x6a27b80, engine_specific=0x7a4e028, nengine=8, ttl=254 '\376', tap_flags=1, tap_event=TAP_CHECKPOINT_END,
    tap_seqno=106326, key=0x7a4e028, nkey=14, flags=0, exptime=0, cas=0, data=0x7a4e036, ndata=8,
    vbucket=<value optimized out>) at src/ep_engine.cc:1040
#7 0x00007ff453a08a04 in bucket_tap_notify (handle=<value optimized out>, cookie=0x6a27b80, engine_specific=0x7a4e028,
    nengine=8, ttl=254 '\376', tap_flags=42136, tap_event=TAP_CHECKPOINT_END, tap_seqno=106326, key=0x7a4e028, nkey=14,
    flags=0, exptime=0, cas=0, data=0x7a4e036, ndata=8, vbucket=<value optimized out>) at bucket_engine.c:1942
#8 0x0000000000409eb2 in process_bin_tap_packet (event=TAP_CHECKPOINT_END, c=0x6a27b80) at daemon/memcached.c:3031
#9 0x00000000004121fa in process_bin_packet (c=0x6a27b80) at daemon/memcached.c:3129
#10 complete_nread_binary (c=0x6a27b80) at daemon/memcached.c:3738
#11 complete_nread (c=0x6a27b80) at daemon/memcached.c:3820
#12 conn_nread (c=0x6a27b80) at daemon/memcached.c:5673
#13 0x00000000004068a5 in event_handler (fd=<value optimized out>, which=<value optimized out>, arg=0x6a27b80)
    at daemon/memcached.c:5936
#14 0x00007ff4565ef48c in event_process_active_single_queue (base=0x6a84c80, flags=<value optimized out>) at event.c:1308
#15 event_process_active (base=0x6a84c80, flags=<value optimized out>) at event.c:1375
#16 event_base_loop (base=0x6a84c80, flags=<value optimized out>) at event.c:1572
#17 0x0000000000414b94 in worker_libevent (arg=0x221f5f0) at daemon/thread.c:301
#18 0x00007ff455aead8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x00007ff455834fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#20 0x0000000000000000 in ?? ()
Comment by Jin Lim [ 07/Dec/12 ]
Andrei, the node 93 which crashed was it a node that being removed or failed over? This will give us clear understanding of why and what this node was doing prior to the crash.
Thanks much!
Jin
Comment by Jin Lim [ 07/Dec/12 ]
Just talked to Chiyoung for his input. Given that these nodes are all Windwos, can you please also verify if all nodes had been upgraded with Service Pack 1? We believe this is a memory corruption issue and we wanted to make sure that all nodes have proper Windows environment. Thanks.
Comment by Jin Lim [ 07/Dec/12 ]
Please assign back to Jin (or ep engine team) after addressing above questions. Thanks.
Comment by Aliaksey Artamonau [ 07/Dec/12 ]
See Andrei's comment above. He ran the test on centos 64.
Comment by Jin Lim [ 07/Dec/12 ]
EP Engine team will try to reproduce the crash (segfault) with more debugging. In the mean time QE team please see if the manual test steps described above can be automated or incorporated into a new test case. Thanks.
Comment by Andrei Baranouski [ 09/Dec/12 ]
93 was a master originally, and it was never removed from the cluster in this test on centos 64
Comment by Farshid Ghods (Inactive) [ 10/Dec/12 ]
deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )
Comment by Jin Lim [ 14/Aug/13 ]
can not reproduce
Generated at Wed Nov 26 11:23:01 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.