[MB-5546] Increasing the default timeouts on ns_server to avoid rebalance failures due to ep-engine stats timeout issues in large cluster or clusters where some nodes are actively using swap Created: 13/Jun/12 Updated: 09/Jan/13 Resolved: 13/Jun/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | ns_server |
| Affects Version/s: | 1.8.1-release-candidate |
| Fix Version/s: | 1.8.1 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Karan Kumar | Assignee: | Aleksey Kondratenko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows small/large cluster
Linux small/large cluster Bucket 1, default vbuckets 1024 RAM 18.7G Nodes 4 ( 2 form the base-cluster) Items Setup for 20M items |
||
| Description |
|
Related issues:-
We have multiple bugs related to the timeouts we are hitting on ns_server :- 1) When in swap 2) On windows even on a small cluster. This bug is to recommend increasing the default timeouts. We used the following timeouts on most of the params, its not all in one solution, but hopefully would cover basic secnarios. ns_memcached_outer, 60000 ns_memcached_open_checkpoint, 60000 ns_memcached_outer_heavy, 60000 ns_memcached_outer_very_heavy, 120000 ns_memcached_connected, 10000 ebucketmigrator_connect, 60000 Summary, some error messages and fixes that worked:- 1) Rebalance exited with reason {exited} {'EXIT',<0.22700.12>,{timeout,{gen_server,call,[{'ns_memcached-default','ns_1@10.3.2.81'},{stats,<<"tap">>},30000]}}}} Fix : adjust timeout value - 120sec - ns_memcached_outer_very_heavy 2) Rebalance exited with reason {exited, {replicator_died, Fix: Adjust timeout value - 120 sec - ns_memcached_outer_heavy 3) Rebalance exited with reason {exited, {'EXIT',<0.24287.15>, {timeout, {gen_server,call, [{'ns_memcached-default','ns_1@10.3.2.81'}, {stats,<<"tap">>}, 30000]}}}} Fix : Adjust timeout to 120sec 4) Rebalance exited with reason {{change_filter_failed, {'EXIT', {timeout, Fix : Adjust timeout values - ebucketmigrator_connect 120 secs ns_memcached_connected 1 sec |
| Comments |
| Comment by Aleksey Kondratenko [ 13/Jun/12 ] |
| I'm a little bit reluctant to change ns_memcached_connected timeout. It's timeout we're using when asking if ns_memcached is alive. It'll just mark bucket as not quite healthy without failing anything. So raising timeout has some effects on autofailover and other things. Something I don't want to do. |
| Comment by Aleksey Kondratenko [ 13/Jun/12 ] |
| Timeouts were bumped in a commit merged for branch-181 and merged up to master. Except, as noted above, ns_memcached_connected timeout |
| Comment by Karan Kumar [ 13/Jun/12 ] |
|
Thanks Alk. Duly noted the concerns. http://review.couchbase.org/#change,17230 |
| Comment by Thuan Nguyen [ 13/Jun/12 ] |
|
Integrated in github-ns-server-2-0 #374 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/374/]) Result = SUCCESS Aliaksey Artamonau : Files : * src/ns_memcached.erl * src/ebucketmigrator_srv.erl |