[MB-6493] Rebalance exits with reason "timeout, ns_memcached-bucket','ns_1@10.1.3.238'}, {get_vbucket,". Successive attempts to rebalance also fail. Created: 30/Aug/12  Updated: 26/Sep/12  Resolved: 26/Sep/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta
Fix Version/s: 2.0-beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Abhinav Dangeti Assignee: Abhinav Dangeti
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: vBuckets: 1024
Env: Centos 64bit

Attachments: GZip Archive 10.1.3.236-8091-diag.txt.gz     GZip Archive 10.1.3.238-8091-diag.txt.gz     GZip Archive 10.3.2.54-8091-diag.txt.gz     GZip Archive 10.3.2.55-8091-diag.txt.gz    

 Description   
- Create a 2:2 cluster.
- Setup unidirectional replication of 2M items
- Rebalance out one of the destination nodes

Example scenario:
Source: 10.1.3.236, 10.3.2.54
Destination: 10.1.3.238, 10.3.2.55
Rebalance out 10.3.2.55

Towards the end of rebalancing on the destination, rebalance fails with the following reason:
Rebalance exited with reason {timeout,
{gen_server,call,
[{'ns_memcached-bucket','ns_1@10.1.3.238'},
{get_vbucket,749},
60000]}}


 Comments   
Comment by Abhinav Dangeti [ 30/Aug/12 ]
Also seen in bug MB-6385.
Comment by Junyi Xie (Inactive) [ 30/Aug/12 ]
Looks like a bunch of ns_server crash in 10.1.3.238-8091-diag.txt.


ns_server:debug,2012-08-30T14:17:41.431,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:init:43]starting ns_port_server with delay of 5000
[error_logger:error,2012-08-30T14:17:41.431,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
     Supervisor: {local,ns_port_sup}
     Context: child_terminated
     Reason: {abnormal,71}
     Offender: [{pid,<0.7380.3>},
                  {name,
                      {memcached,"/opt/couchbase/bin/memcached",
                          ["-X",
                           "/opt/couchbase/lib/memcached/stdin_term_handler.so",
                           "-X",
                           "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
                           "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
                           "11210","-E",
                           "/opt/couchbase/lib/memcached/bucket_engine.so",
                           "-B","binary","-r","-c","10000","-e",
                           "admin=_admin;default_bucket_name=default;auto_create=false",
                           []],
                          [{env,
                               [{"EVENT_NOSELECT","1"},
                                {"MEMCACHED_TOP_KEYS","100"},
                                {"ISASL_PWFILE",
                                 "/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
                                {"ISASL_DB_CHECK_TIME","1"}]},
                           use_stdio,stderr_to_stdout,exit_status,
                           port_server_send_eol,stream]}},
                  {mfargs,
                      {erlang,apply,
                          [#Fun<ns_port_sup.3.119727222>,
                           [memcached,"/opt/couchbase/bin/memcached",
                            ["-X",
                             "/opt/couchbase/lib/memcached/stdin_term_handler.so",
                             "-X",
                             "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
                             "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
                             "11210","-E",
                             "/opt/couchbase/lib/memcached/bucket_engine.so",
                             "-B","binary","-r","-c","10000","-e",
                             "admin=_admin;default_bucket_name=default;auto_create=false",
                             []],
                            [{env,
                                 [{"EVENT_NOSELECT","1"},
                                  {"MEMCACHED_TOP_KEYS","100"},
                                  {"ISASL_PWFILE",
                                   "/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
                                  {"ISASL_DB_CHECK_TIME","1"}]},
                             use_stdio,stderr_to_stdout,exit_status,
                             port_server_send_eol,stream]]]}},
                  {restart_type,permanent},
                  {shutdown,86400000},
                  {child_type,worker}]


[error_logger:info,2012-08-30T14:17:41.433,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
          supervisor: {local,ns_port_sup}
             started: [{pid,<0.7445.3>},
                       {name,
                           {memcached,"/opt/couchbase/bin/memcached",
                               ["-X",
                                "/opt/couchbase/lib/memcached/stdin_term_handler.so",
                                "-X",
                                "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
                                "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
                                "11210","-E",
                                "/opt/couchbase/lib/memcached/bucket_engine.so",
                                "-B","binary","-r","-c","10000","-e",
                                "admin=_admin;default_bucket_name=default;auto_create=false",
                                []],
                               [{env,
                                    [{"EVENT_NOSELECT","1"},
                                     {"MEMCACHED_TOP_KEYS","100"},
                                     {"ISASL_PWFILE",
                                      "/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
                                     {"ISASL_DB_CHECK_TIME","1"}]},
                                use_stdio,stderr_to_stdout,exit_status,
                                port_server_send_eol,stream]}},
                       {mfargs,
                           {erlang,apply,
                               [#Fun<ns_port_sup.3.119727222>,
                                [memcached,"/opt/couchbase/bin/memcached",
                                 ["-X",
                                  "/opt/couchbase/lib/memcached/stdin_term_handler.so",
                                  "-X",
                                  "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
                                  "-l","0.0.0.0:11210,0.0.0.0:11209:1000",
                                  "-p","11210","-E",
                                  "/opt/couchbase/lib/memcached/bucket_engine.so",
                                  "-B","binary","-r","-c","10000","-e",
                                  "admin=_admin;default_bucket_name=default;auto_create=false",
                                  []],
                                 [{env,
                                      [{"EVENT_NOSELECT","1"},
                                       {"MEMCACHED_TOP_KEYS","100"},
                                       {"ISASL_PWFILE",
                                        "/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
                                       {"ISASL_DB_CHECK_TIME","1"}]},
                                  use_stdio,stderr_to_stdout,exit_status,
                                  port_server_send_eol,stream]]]}},
                       {restart_type,permanent},
                       {shutdown,86400000},
                       {child_type,worker}]

[user:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:handle_info:107]Port server memcached on node 'ns_1@10.1.3.238' exited with status 71. Restarting. Messages: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use
[ns_server:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:log:169]memcached<0.7446.3>: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use

[ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:handle_info:62]Cushion managed supervisor for memcached failed: {abnormal,71}
[user:warn,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:die_slowly:81]Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s

[error_logger:error,2012-08-30T14:17:41.519,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_msg:76]** Generic server <0.7446.3> terminating
** Last message in was {#Port<0.127099>,{exit_status,71}}
** When Server state == {state,#Port<0.127099>,memcached,
                               {["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use",
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty],
                                [empty]},
                               {ok,{1346361461712443,#Ref<0.0.35.116743>}},
                               ["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use"],
                               0,true}
** Reason for termination ==
** {abnormal,71}

[ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:ns_log:ns_log:handle_cast:125]suppressing duplicate log supervisor_cushion:1("Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s\n") because it's been seen 1 times in the past 36.083474 secs (last seen 36.083474 secs ago
[error_logger:error,2012-08-30T14:17:41.521,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_port_server:init/1
    pid: <0.7446.3>
    registered_name: ns_port_memcached
    exception exit: {abnormal,71}
      in function gen_server:terminate/6
    ancestors: [<0.7445.3>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
                  <0.60.0>]
    messages: [{'EXIT',#Port<0.127099>,normal}]
    links: [<0.7445.3>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 1597
    stack_size: 24
    reductions: 2399
  neighbours:

[stats:error,2012-08-30T14:17:41.626,ns_1@10.1.3.238:<0.5564.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[stats:error,2012-08-30T14:17:41.628,ns_1@10.1.3.238:<0.5533.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[stats:error,2012-08-30T14:17:41.630,ns_1@10.1.3.238:<0.5557.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
[error_logger:error,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
     Supervisor: {local,menelaus_sup}
     Context: child_terminated
     Reason: {noproc,
                     {gen_server,call,
                         [{'stats_reader-bucket','ns_1@10.1.3.238'},
                          {latest,minute,1}]}}
     Offender: [{pid,<0.7432.3>},
                  {name,menelaus_web_alerts_srv},
                  {mfargs,{menelaus_web_alerts_srv,start_link,[]}},
                  {restart_type,permanent},
                  {shutdown,5000},
                  {child_type,worker}]


[error_logger:info,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
          supervisor: {local,menelaus_sup}
             started: [{pid,<0.7458.3>},
                       {name,menelaus_web_alerts_srv},
                       {mfargs,{menelaus_web_alerts_srv,start_link,[]}},
                       {restart_type,permanent},
                       {shutdown,5000},
                       {child_type,worker}]

[ns_server:debug,2012-08-30T14:17:43.182,ns_1@10.1.3.238:ns_heart:ns_heart:current_status:155]Ignoring failure to get stats for bucket: "bucket":
{'EXIT',{noproc,{gen_server,call,
                            [{'stats_reader-bucket','ns_1@10.1.3.238'},
                             {latest,minute}]}}}








                                                                                                                                                                                            
Comment by Junyi Xie (Inactive) [ 30/Aug/12 ]
Abhinav, please assign to ns_server team.

Comment by Abhinav Dangeti [ 30/Aug/12 ]
Both ns_server and memcached seem to have crashsed, aleksey can you please take a look at this?
Comment by Ketaki Gangal [ 30/Aug/12 ]
Repeated rebalance on this cluster also fails with other errors. Abhinav can you post other rebalance failures as well?
Comment by Abhinav Dangeti [ 30/Aug/12 ]
Also seen, once rebalance failed if we try to rebalance again,
rebalance "is failed with reason not_all_nodes_are_ready_yet".
Comment by Aleksey Kondratenko [ 30/Aug/12 ]
From logs I'm seeing there's some swap usage.

That's causing long memcached request execution times. And we're doing quite a lot of memcached activity that understandably times out in such condition.
Comment by Junyi Xie (Inactive) [ 31/Aug/12 ]
Hi Peter,

I am not sure what I can fix. By Alk, it is "understandably times out in such condition". I am not sure why rebalance should fail even memcached takes long execution time when using swap.

But it sounds like a generic rebalancing problem, that rebalancing may fail if there is swap usage.


Assume it is expected that rebalance fails in this case, I am not sure what XDCR can do from source node. Please advise.


Thanks,

Junyi
Comment by Junyi Xie (Inactive) [ 31/Aug/12 ]
Also sounds it is similar to MB-6284, both are cased by swap usage and caused time out at ns_server and eventually made rebalance failed. Mark it duplicate
Comment by Farshid Ghods (Inactive) [ 31/Aug/12 ]
i see ns-server tried to restart memcached , then i see memcached crashed and then i also see the bug is also being closed as timeout issue.

I am going to assign this bug back to QE to dig deeper.
Comment by Farshid Ghods (Inactive) [ 31/Aug/12 ]
also i find it very hard to believe that when memcached mem_used is reported as only 400 MB and the box has 2 GB ram

[stats:debug,2012-08-30T13:39:15.175,ns_1@10.1.3.238:<0.1206.0>:stats_collector:maybe_log_stats:126](at {{1970,1,16},{5,59,19}}) Stats for bucket "bucket":
curr_items 287939
curr_items_tot 635438
mem_used 429145136


[couchdb:error,2012-08-30T14:16:53.148,ns_1@10.1.3.238:<0.3227.3>:couch_log:error:42]Uncaught error in HTTP request: {exit,
                                 {timeout,
                                  {gen_server,call,
                                   ['ns_memcached-bucket',
                                    {stats,<<>>},
                                    60000]}}}


As alk mentioned there are some swap usage

730582: {swap_used,629829632}]},
743695: {swap_used,629829632}]},
743872: {swap_used,629829632}]},

but again there are as you see 500 MB free memory available and we dont have indexing going on here so not sure why we are using swap


1549173: {free_memory,5756739584},
1550434: {free_memory,5756342272},
1550518: {free_memory,5757095936},
1550601: {free_memory,579661824},
1550682: {free_memory,5757095936},
1550765: {free_memory,579661824},
1551091: {free_memory,5143658496},
1761876: {free_memory,321335296},
1761946: {free_memory,5732896768},


yes this machine has only 2 GB ram but memcached is only using 400 MB and we have 500 MB free space.

math does not add up for me and we can not close this bug until QE digs deepers into logs and find out more about what went wrong


box also has 4 cpus

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
stepping : 2
cpu MHz : 2660.000
cache size : 12288 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc ida nonstop_tsc arat pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.00
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
Comment by Farshid Ghods (Inactive) [ 05/Sep/12 ]
Junyi/Alk,

feel free to declare this as won't fix but please look at my latest comment regarding the environment and stats gathered here.
Comment by Junyi Xie (Inactive) [ 05/Sep/12 ]
Based on Fashid's comment.
Comment by Peter Wansch (Inactive) [ 06/Sep/12 ]
Damien is changing the XDCR replication manager initialization code which may address this problem.
Comment by Aleksey Kondratenko [ 06/Sep/12 ]
Apparently Damien is working on this. Thus getting this off my list
Comment by Thuan Nguyen [ 07/Sep/12 ]
Integrated in github-ns-server-2-0 #461 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/461/])
    MB-6493: Throttle xdc_vbucket_rep initialization (Revision ff26c4fda40322ba2c0eb5d039edf5fedbc282f5)
MB-6493: Add supervisor2 and use for restart of xdc_vbucker_rep (Revision 3f5ee80e654f7b83621e25b4f0f5c52c3a84b6b3)
MB-6493: Fix vb map cache regression when xdcr init fails (Revision 807184854ac3c67740830dc83b5b17170282fe60)

     Result = SUCCESS
pwansch :
Files :
* src/xdc_replication.erl
* src/xdc_vbucket_rep.erl

pwansch :
Files :
* src/xdc_replication.erl
* src/xdc_vbucket_rep.erl
* src/ns_config_default.erl
* src/supervisor2.erl
* include/xdc_replicator.hrl
* src/xdc_vbucket_rep_sup.erl

pwansch :
Files :
* src/xdc_vbucket_rep.erl
Comment by damien [ 07/Sep/12 ]
Abhinav, can you verify if this is still a problem with my latest changes?
Comment by Abhinav Dangeti [ 07/Sep/12 ]
Reproduced the scenario on build 1695: Problem doesn't persist.
Comment by Farshid Ghods (Inactive) [ 08/Sep/12 ]
Ketaki,

are you still running more tests to verify the fix ?

Comment by Ketaki Gangal [ 08/Sep/12 ]
2.0-1700. Rebalance is working as expected, seeing no timeout errors.
Generated at Mon Sep 22 20:17:02 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.