[MB-7554] Rebalance fails with "bad match wait_backfill_determination" error on a very small load Created: 17/Jan/13  Updated: 04/Feb/13  Resolved: 22/Jan/13

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0.1
Fix Version/s: 2.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Ketaki Gangal Assignee: Ketaki Gangal
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.0.1-125

Attachments: Zip Archive ns-diag-20130117231257.txt-1.zip     Zip Archive ns-diag-20130117231257.txt.zip    

 Description   
Load 1M items on a 4 node cluster.
Rebalance in 2 nodes.

Rebalance and Compaction start in parallel.

Rebalance is very slow in initial few minutes, catches up, but fails with a timeout exit,

The load/cluster is a very basic configiuration. This is a working on 2.0

** Reason for termination ==
** {unexpected_exit,
       {'EXIT',<0.896.2>,
           {{badmatch,
                [{'EXIT',
                     {timeout,
                         {gen_server,call,
                             [<20117.4759.0>,had_backfill,30000]}}}]},
            [{ns_single_vbucket_mover,
                 '-wait_backfill_determination/1-fun-1-',1}]}}}

[error_logger:error,2013-01-17T23:11:39.476,ns_1@10.176.169.6:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_vbucket_mover:init/1
    pid: <0.24152.1>
    registered_name: []
    exception exit: {unexpected_exit,
                     {'EXIT',<0.896.2>,
                      {{badmatch,
                        [{'EXIT',
                          {timeout,
                           {gen_server,call,
                            [<20117.4759.0>,had_backfill,30000]}}}]},
                       [{ns_single_vbucket_mover,
                         '-wait_backfill_determination/1-fun-1-',1}]}}}
      in function gen_server:terminate/6
    ancestors: [<0.13807.1>]
    messages: [{backfill_done,
                      {'ns_1@10.176.169.6',1019,
                          ['ns_1@10.176.169.6','ns_1@10.169.54.218'],
                          ['ns_1@10.176.155.132','ns_1@10.168.94.60']}},
                  {move_done_new_style,
                      {'ns_1@10.176.169.6',1019,
                          ['ns_1@10.176.169.6','ns_1@10.169.54.218'],
                          ['ns_1@10.176.155.132','ns_1@10.168.94.60']}},
                  {'EXIT',<0.6884.2>,normal},
                  {backfill_done,
                      {'ns_1@10.169.54.218',678,
                          ['ns_1@10.169.54.218','ns_1@10.168.173.242'],
                          ['ns_1@10.168.94.60','ns_1@10.176.155.132']}},
                  {move_done_new_style,
                      {'ns_1@10.169.54.218',678,
                          ['ns_1@10.169.54.218','ns_1@10.168.173.242'],
                          ['ns_1@10.168.94.60','ns_1@10.176.155.132']}},
                  {'EXIT',<0.7095.2>,normal}]
    links: [<0.13807.1>,<0.24159.1>,<0.57.0>]
    dictionary: [{bucket_name,"default"},
                  {i_am_master_mover,true},
                  {child_processes,[<0.7095.2>,<0.6884.2>,<0.6760.2>,
                                    <0.3866.2>,<0.3858.2>,<0.862.2>,<0.855.2>,
                                    <0.852.2>,<0.787.2>,<0.26181.1>,
                                    <0.26131.1>,<0.26086.1>,<0.24174.1>,
                                    <0.24173.1>]}]
    trap_exit: true
    status: running
    heap_size: 28657
    stack_size: 24
    reductions: 1198089

Logs at


 Comments   
Comment by Aliaksey Artamonau [ 17/Jan/13 ]
I need diags from other nodes. From 'ns_1@10.176.155.132' in particular.
Comment by Ketaki Gangal [ 17/Jan/13 ]
The cluster is no longer around.

Do we have an idea of what is causing these timeouts based off these limited logs?
Comment by Aliaksey Artamonau [ 21/Jan/13 ]
No, unfortunately there's not enough information there.
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
please reopen if this case occurs again
Generated at Sat Aug 23 12:21:09 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.