Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Blocker
-
Resolution: Fixed
-
Affects Version/s: 2.0
-
Fix Version/s: 2.0.1
-
Component/s: ns_server, view-engine
-
Security Level: Public
-
Labels:
-
Flagged:Release Note
Description
SUBJ.
In many diags we were seeing we're seeing occasional timeouts here and there. Sometimes and perhaps most of the time they don't affect correct operation of product. After all erlang is famous for it's fault resiliency.
But sometimes it causes rebalance to fail. I.e. seeMB-7166 where mb_master which supervised ns_orchestrator which supervised rebalance died due to timeout. Which according to normal error handling behavior of Erlang caused it's restart. But part of restart was shutting down of child processes, including obviously rebalancer.
In my personal experience this is quite easy to hit on physical hardware and spinning disks. But apparently we're now getting in on Xen and SSDs as well as potentially (MB-7152) on physical hardware and SSDs.
In many diags we were seeing we're seeing occasional timeouts here and there. Sometimes and perhaps most of the time they don't affect correct operation of product. After all erlang is famous for it's fault resiliency.
But sometimes it causes rebalance to fail. I.e. see
In my personal experience this is quite easy to hit on physical hardware and spinning disks. But apparently we're now getting in on Xen and SSDs as well as potentially (
MB-6595which we started hitting before switching off async io threads. We're working on investigating it as well and so far we have some evidence that it's somehow but we not sure how is related by somewhat excessive rate of page allocations/deallocations by erlang.Anyways,
MB-6595is most likely distinct issue. In my own experience timeouts are easier to hit with async io threads off so there seems to be two causes of timeouts. And this ticket (lack of async io threads) seems like worst one.