[RN 2.0.1]][longevity] something unknown is causing severe timeouts in ns_server. Particularly under views building and/or compaction. Which causes rebalance to fail and other types of badness. (MB-6595)

[MB-7212] Rebalance stopped by janitor when xdcr replication is created on source cluster (Resetting rebalance status since it's not really running) Created: 18/Nov/12  Updated: 07/May/13  Resolved: 30/Apr/13

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0
Fix Version/s: 2.1.0
Security Level: Public

Type: Technical task Priority: Major
Reporter: Andrei Baranouski Assignee: Andrei Baranouski
Resolution: Fixed Votes: 0
Labels: 2.0-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive 10.3.121.112-8091-diag.txt.gz     GZip Archive 10.3.121.113-8091-diag.txt.gz     GZip Archive 10.3.121.115-8091-diag.txt.gz     GZip Archive 10.3.121.116-8091-diag.txt.gz     PNG File replication_buckets.png     PNG File replication.png    

 Description   
steps:

1. cluster #1
2 nodes: 10.3.121.112, 10.3.121.113
3 buckets with ~700K items in each:
default,
sasl:sasl( replica=2)
standard

cluster #2
10.3.121.115:
3 buckets:
 default1:default1( replica=2)
 sasl( no password, replica is disabled)
 
2. cluster#2 rebalance in 10.3.121.116

3. during rebalance create on 10.3.121.112 create xdcr replications:

default this cluster bucket "default1" on cluster "http://10.3.121.115/" Replicating Last 10 errors on change
sasl this cluster bucket "sasl" on cluster "http://10.3.121.115/" Starting Up on change
standard this cluster bucket "standard" on cluster "http://10.3.121.115/" Replicating Last 10 errors on change

and on 10.3.121.115 create:
standard this cluster bucket "standard" on cluster "http://10.3.121.112:8091/" Replicating on change

see screenshot with details

result:

[ns_server:debug,2012-11-18T4:22:25.816,ns_1@10.3.121.115:ns_config_log<0.338.0>:ns_config_log:log_common:111]config change:
rebalance_status ->
{none,<<"Rebalance stopped by janitor.">>}
[ns_server:info,2012-11-18T4:22:25.992,ns_1@10.3.121.115:<0.5807.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/209">>
[ns_server:info,2012-11-18T4:22:26.405,ns_1@10.3.121.115:<0.5839.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/210">>
[ns_server:info,2012-11-18T4:22:26.725,ns_1@10.3.121.115:<0.5860.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/211">>
[ns_server:info,2012-11-18T4:22:27.334,ns_1@10.3.121.115:<0.5915.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/212">>
[ns_server:info,2012-11-18T4:22:27.810,ns_1@10.3.121.115:<0.5937.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/213">>
[ns_server:info,2012-11-18T4:22:28.086,ns_1@10.3.121.115:<0.5980.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/214">>
[ns_server:error,2012-11-18T4:22:28.602,ns_1@10.3.121.115:ns_memcached-standard<0.21850.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 853680 us
[ns_server:info,2012-11-18T4:22:28.785,ns_1@10.3.121.115:<0.6043.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/215">>
[ns_server:error,2012-11-18T4:22:29.450,ns_1@10.3.121.115:<0.9097.0>:ns_memcached:verify_report_long_call:297]call {get_meta,<<"default111-5c34112c98cb69d2">>,883} took too long: 551954 us
[ns_server:error,2012-11-18T4:22:29.451,ns_1@10.3.121.115:<0.21854.0>:ns_memcached:verify_report_long_call:297]call {get_meta,<<"standard-5e1b18c4c6a6d316">>,751} took too long: 549644 us
[ns_server:info,2012-11-18T4:22:29.652,ns_1@10.3.121.115:<0.6111.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/216">>
[ns_server:info,2012-11-18T4:22:29.918,ns_1@10.3.121.115:<0.6151.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/217">>
[ns_server:error,2012-11-18T4:22:29.968,ns_1@10.3.121.115:ns_memcached-sasl<0.15429.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 1629801 us
[ns_server:info,2012-11-18T4:22:30.052,ns_1@10.3.121.115:<0.6192.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/218">>
[ns_server:info,2012-11-18T4:22:30.400,ns_1@10.3.121.115:<0.6233.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/219">>
[ns_server:info,2012-11-18T4:22:30.762,ns_1@10.3.121.115:<0.6261.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/220">>
[ns_server:info,2012-11-18T4:22:30.900,ns_1@10.3.121.115:<0.6285.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/221">>
[ns_server:info,2012-11-18T4:22:31.548,ns_1@10.3.121.115:<0.6348.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/222">>
[ns_server:error,2012-11-18T4:22:31.882,ns_1@10.3.121.115:<0.9099.0>:ns_memcached:verify_report_long_call:297]call {stats,<<>>} took too long: 627784 us
[ns_server:info,2012-11-18T4:22:31.920,ns_1@10.3.121.115:<0.6451.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/223">>
[ns_server:error,2012-11-18T4:22:31.947,ns_1@10.3.121.115:<0.21855.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 5994193 us
[stats:warn,2012-11-18T4:22:32.332,ns_1@10.3.121.115:<0.21861.0>:stats_collector:latest_tick:223]Dropped 6 ticks
[ns_server:info,2012-11-18T4:22:32.661,ns_1@10.3.121.115:<0.6494.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/224">>
[ns_server:info,2012-11-18T4:22:33.516,ns_1@10.3.121.115:<0.6536.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/225">>
[ns_server:error,2012-11-18T4:22:33.848,ns_1@10.3.121.115:<0.21855.0>:ns_memcached:verify_report_long_call:297]call {stats,<<>>} took too long: 692405 us
[ns_server:info,2012-11-18T4:22:34.464,ns_1@10.3.121.115:<0.6579.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/226">>
[ns_server:error,2012-11-18T4:22:34.947,ns_1@10.3.121.115:<0.21855.0>:ns_memcached:verify_report_long_call:297]call list_vbuckets took too long: 700928 us
[stats:warn,2012-11-18T4:22:35.000,ns_1@10.3.121.115:<0.21861.0>:stats_collector:latest_tick:223]Dropped 1 ticks
[ns_server:info,2012-11-18T4:22:35.363,ns_1@10.3.121.115:<0.6615.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/227">>
[ns_server:info,2012-11-18T4:22:35.662,ns_1@10.3.121.115:<0.6639.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/228">>
[ns_server:error,2012-11-18T4:22:36.373,ns_1@10.3.121.115:<0.15433.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 4081382 us
[ns_server:info,2012-11-18T4:22:36.576,ns_1@10.3.121.115:<0.6716.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/229">>
[ns_server:info,2012-11-18T4:22:37.330,ns_1@10.3.121.115:<0.6747.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/230">>
[ns_server:info,2012-11-18T4:22:37.680,ns_1@10.3.121.115:<0.6772.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/231">>
[ns_server:info,2012-11-18T4:22:37.960,ns_1@10.3.121.115:<0.6809.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/232">>
[ns_server:error,2012-11-18T4:22:37.843,ns_1@10.3.121.115:ns_memcached-default1<0.9095.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 1210948 us
[ns_server:info,2012-11-18T4:22:38.306,ns_1@10.3.121.115:<0.6865.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/233">>
[ns_server:info,2012-11-18T4:22:38.777,ns_1@10.3.121.115:<0.6932.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/234">>
[ns_server:error,2012-11-18T4:22:38.955,ns_1@10.3.121.115:<0.15434.0>:ns_memcached:verify_report_long_call:297]call {stats,<<>>} took too long: 843491 us
[ns_server:error,2012-11-18T4:22:39.098,ns_1@10.3.121.115:ns_memcached-standard<0.21850.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 1347335 us
[ns_server:error,2012-11-18T4:22:39.272,ns_1@10.3.121.115:ns_memcached-sasl<0.15429.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 913668 us




                           {cluster_compat_version,
                            [{'_vclock',[{'ns_1@127.0.0.1',{1,63520456425}}]},
                             2,0]},
                           {counters,
                            [{'_vclock',
                              [{'ns_1@10.3.121.115',{1,63520459705}}]},
                             {rebalance_start,1}]},
                           {dynamic_config_version,
                            [{'_vclock',[{'ns_1@127.0.0.1',{4,63520456425}}]},
                             2,0]},
                           {email_alerts,
                            [{'_vclock',[{'ns_1@127.0.0.1',{1,63520456424}}]},
                             {recipients,["root@localhost"]},
                             {sender,"couchbase@localhost"},
                             {enabled,false},
                             {email_server,
                              [{user,[]},
                               {pass,[]},
                               {host,"localhost"},
                               {port,25},
                               {encrypt,false}]},
                             {alerts,
                              [auto_failover_node,
                               auto_failover_maximum_reached,
                               auto_failover_other_nodes_down,
                               auto_failover_cluster_too_small,ip,disk,
                               overhead,ep_oom_errors,
                               ep_item_commit_failed]}]},
                           {fast_warmup,
                            [{fast_warmup_enabled,true},
                             {min_memory_threshold,10},
                             {min_items_threshold,10}]},
                           {index_aware_rebalance_disabled,false},
                           {max_bucket_count,10},
                           {memory_quota,
                            [{'_vclock',
                              [{'ns_1@127.0.0.1',{1,63520456426}}]}|
                             3082]},
                           {nodes_wanted,
                            [{'_vclock',
                              [{'ns_1@10.3.121.115',{2,63520459703}}]},
                             'ns_1@10.3.121.115','ns_1@10.3.121.116']},
                           {otp,
                            [{'_vclock',
                              [{'ns_1@10.3.121.116',{2,63520459703}},
                               {'ns_1@127.0.0.1',{1,63520456425}}]},
                             {cookie,khxnkzsiiskgxwef}]},
                           {rebalance_status,
                            [{'_vclock',
                              [{'ns_1@10.3.121.115',{2,63520460544}}]}|
                             {none,<<"Rebalance stopped by janitor.">>}]},
                           {rebalancer_pid,
                            [{'_vclock',
                              [{'ns_1@10.3.121.115',{2,63520460544}}]}|
                             undefined]},
                           {remote_clusters,
                            [{'_vclock',
                              [{'ns_1@10.3.121.115',{1,63520459931}}]},
                             [{uuid,<<"274a6cf7338d4686038a03878a2a176d">>},
                              {name,"http://10.3.121.112:8091/"},
                              {hostname,"10.3.121.112"},
                              {username,"Administrator"},
                              {password,"password"}]]},
                           {replication,[{enabled,true}]},
                           {rest,[{port,8091}]},


[ns_server:info,2012-11-18T4:22:23.761,ns_1@10.3.121.115:<0.5617.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/203">>
[ns_server:info,2012-11-18T4:22:24.425,ns_1@10.3.121.115:<0.5681.4>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default1/204">>
[user:info,2012-11-18T4:22:24.496,ns_1@10.3.121.115:ns_config<0.331.0>:ns_janitor:maybe_stop_rebalance_status:147]Resetting rebalance status since it's not really running


after failure XDCR sasl bucket replication status is Starting Up, and only 871 items were transfered




 Comments   
Comment by Steve Yen [ 20/Nov/12 ]
moved to 2.0.1 per bug scrub
Comment by kzeller [ 05/Dec/12 ]
Added to RN as:

   One of the internal maintenance processes within Couchbase Server for checking vBuckets
       will stop node rebalance when you create replication via XDCR on a source cluster.
Comment by Junyi Xie (Inactive) [ 24/Jan/13 ]
Looks to me this is a rebalance issue (rebalance stopped when the system is under some workload) and unrelated to XDCR component. XDCR just exposed the issue.

Remove xdcr from the component and watch this bug.
Comment by Aleksey Kondratenko [ 24/Jan/13 ]
Agree with Junyi. It's about timeouts, not directly about xdcr.

However XDCR evidently pushes system more heavily into timeoutful state.

Let me also note that it's perhaps fixed already by +A and other timeout-fixing commits.
Comment by Aleksey Kondratenko [ 24/Jan/13 ]
Worth re-testing
Comment by Maria McDuff (Inactive) [ 27/Mar/13 ]
QE --- pls retest in latest 2.0.2 build. Thanks.
Comment by Maria McDuff (Inactive) [ 29/Apr/13 ]
andrei, can u pls re-test and update this bug? Thanks.
Comment by Andrei Baranouski [ 30/Apr/13 ]
rebalance passed with the same steps on 2.0.2-789
Comment by Andrei Baranouski [ 30/Apr/13 ]
Maria, can we close it now?
Comment by Maria McDuff (Inactive) [ 06/May/13 ]
yes, pls close. thanks.
Generated at Thu Oct 23 10:28:49 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.