4.1.0-EE vs 4.1.1-EE: indexer too slow

Hello, it’s @grep, but i’am a @egrep (temporary, i hope, because of Help needed: strange login behavoir / can't login to forum)

So, what do i see for Loading... ?

  1. Closed
  2. Won’t Do

I would like to post the following to JIRA, but @egrep credentials are not allowed to login, so…

Jim Walker has explained the problem with “undersizing” (lack of cpu). Let’s take a look at the explanation “core” (from my point of view):

There are times where we begin a DCP task, get data ready to send, but many of the other background tasks run before the frontend thread gets its chance to run and send the data, hence why sometimes you get pauses on DCP.

I think, there is a “logical flaw” in this statement:

  1. Nothing (i suppose) will prevent the same situation for N-cpu system. When i try (imaginary) to scale this situation to N cpu, i see no differences at all: DCP/FE-task has the same chance “to lost it’s fight for CPU with lots of background tasks” in N-cpu system (if number of background tasks increases), because OS-task-scheduling algorithm is the same. (I should mention here, that i don’t exactly know how OS-task-scheduling algorithm works for N-cpus and this is a “logical flaw” in my position)
  2. This problem, from my point of view, is a QoS-problem: some tasks within CB must have “high priority” for execution. “Building complex solution for prioritization of CB-tasks” is a real way to solve described problem

Little bit of emotion:

Hope is not a strategy.

  • Traditional SRE saying (cite from SRE: HOW GOOGLE RUNS PRODUCTION SYSTEMS)

Should we really hope that “For N cpus the problem is going to be resolved by OS-kernel-scheduler?”.

ok, no more emotions
:wink:

So, i think, the problem exists, and “all versions of Couchbase are affected under certain circumstances”. Lack of CPUs just allows to emulate such circumstances more easily. But, imho, this is more like “complex architecture problem”. I assume, that there were no need to think about CB-tasks QoS before (am i right ?) . And probably, there is no need now because of “minimal requirements” of 4 CPU hide the problem: QoS, by definition, is needed when “there is an unmanaged low-level concurrency that does not allow to solve high-level problems” (for example: well-known voice traffic problems with jitter without QoS) and there is no such big concurrency caused by same load for 4 CPUs as for 1 CPU (or 2 CPUs, because 2 it not enough too - problem still shows itself).

I would like to thank @vmx, who discovered 2 things for me (with adjustment, that @vmx is not a developer of that part of server):

  1. The frontend threads got priority over DCP (still thinking, am i right in assumption of lack of CB-tasks QoS ? maybe, there is something like pre-QoS-implementation?)
  2. It is better to let DCP block than other operations

Ok, i got it. But maybe now it’s an occasion to think about backpressure implementation (as one on steps) to make “system without pauses caused by overloading” ? (sorry, yea, i understand that it is much more easier to give an advice, than “to sit and implement”)

And finally, about “bad luck for 4.1.X users”. Jim made a good assumption (without proofs, but my knowledge about “how it works” makes me to agree), that synchronization primitives change caused less of “forced rescheduling” that had “helped” versions prior to 4.1.1 to work fine. But those changes (especially using of cmpxchg instead of mutexes) brought much more performance (google about this), so it was the right way, i think.

So:

  1. Seems like there will be no fixes for 4.1.X, that could help “undersizers”; solution is to buy more CPUs or use 4.1.0 (with all unresolved issues) :frowning:
  2. Problem still exists (?)

:pensive: