Adding Alk's informative email on various options for 2.0.0...
---------------
Folks, we have observed that idle CPU consumption of our current product is 15-20 percent of quite fast CPU. And we've found that 128 erlang scheduler threads is causing it.
I'm also aware that perf folks found up to 50% performance drop when we enabled 128 erlang schedulers.
Looks like we should consider finding either alternative number or some alternative setup.
I think main difficulty lies in somewhat long cycle of verifying if some particular setting works. And given that I cannot easily propose anything.
Our options are:
* delay any reaction to post-2.0
* decrease to 64 or 32 scheduler threads for more acceptable level of overhead
* try single scheduler queue option of Erlang VM with 12-16 scheduler threads (more on that below)
* find a way to make async threads work. I.e. by kidnapping erlang VM developers and threatening/torturing them :)
* do erlang VM splitting allowing more latency insensitive couchdb-side of our project to run with async IO threads off. Yes it's non-trivial work and it's late.
* something else that's smarter and perhaps crazier
Apparently any option requires extensive testing.
Regarding option three.
I've found that normally erlang has runqueue per scheduler thread. That's supposedly more scalable setup. I.e. given that by default each scheduler thread runs on it's own dedicated CPU, it means CPUs don't bother touching shared data structure.
But R14 still has a way to request single runqueue. I have no idea if that works. Particularly for some reason R15 does not have this option anymore with the following commit:
commit 8781932b3b8769b6f208ac7c00471122ec7dd055
Author: Rickard Green <
rickard@erlang.org>
Date: Fri Nov 18 15:19:46 2011 +0100
Remove common run-queue in SMP case
The common run-queue implementation is removed since it is unused,
untested, undocumented, unsupported, and only complicates the code.
A spinlock used by the run-queue management sometimes got heavily
contended. This code has now been rewritten, and the spinlock
has been removed.
But if it works, then it would solve potential delays caused by schedulers doing blocking io.
I.e. assume we have a bunch of runnable processes. In default mode they will be assigned to some scheduler threads. Potentially many per single scheduler. We've seen that when some scheduler is blocked in IO (which happens when async IO is disabled), it's runqueue is not served by any other scheduler. Which causes some processes to be starved and delayed. There is most likely some sort of work stealing between scheduler thread's runqueues, but apparently it's not working in this particular use-case. I.e. it can be seen that inherently unfair runnable process queuing is causing that. So in order to prevent this from happening we decided to be very generous on setting erlang scheduler threads count. We did that assuming there's little overhead, which is clearly not true.
If we have single shared runqueue, then non-IO processes could be starved only if _all_ scheduler threads are busy doing IO. It's inherently more fair and thus allows for much lower scheduler threads count setting
128 schedulers:
CPU CMD
42% beam.smp
32% beam.smp
32% beam.smp
31% beam.smp
64 schedulers:
CPU CMD
23% beam.smp
18% beam.smp
18% beam.smp
18% beam.smp
32 schedulers:
CPU CMD
17% beam.smp
13% beam.smp
13% beam.smp
13% beam.smp
16 schedulers
CPU CMD
11% beam.smp
9% beam.smp
9% beam.smp
8% beam.smp
8 schedulers:
CPU CMD
10% beam.smp
6% beam.smp
6% beam.smp
6% beam.smp
4 schedulers:
CPU CMD
8% beam.smp
6% beam.smp
5% beam.smp
5% beam.smp