Service 'query' exited with status 134. Restarting. Messages:

CB Enterprise Edition 5.5.0 build 2958.
15 Data-only nodes, 4 Index-only nodes, 4 Query-only nodes. All running 32GB RAM.

We get a lot of the above errors/failures lately.
(and the beauty is, in cbq you just get the prompt back - no error response. in GUI, you get “unexpected error”)

I tested one of the queries that gets this a lot, and it works fine if I run it for a smaller subset of data (e.g. only events from the past 10 days), but if I grow the horizon (say even 15 days) it start to fail. If I shut down all other systems interfacing with CouchBase - it works fine, so obviously something is competing for resources.

I saw that people mentioned OOM for similar issue - but I don’t think so. I don’t think we see any OOM related errors. Also, from the admin GUI, the RAM % on Query servers never goes above 50% (BTW, Why we can’t see CPU/Ram utilization graphs of a given server - for the entire server - and not just per 1 bucket?)

In the GUI log it shows as:*base).sendItem(0xc422993a40, 0x18d0980, 0xc420dac240, 0xe31340)
goproj/src/ +0x57 fp=0xc421040db8 sp=0xc421040d78*IndexScan3).RunOnce.func1()
goproj/src/ +0x5e4 fp=0xc421040f50 sp=0xc421040db8*Once).Do(0xc422993b38, 0xc420b10f88)
goproj/src/ +0x68 fp=0xc421040f78 sp=0xc421040f50*IndexScan3).RunOnce(0xc422993a40, 0xc420927080, 0x0, 0x0)
goproj/src/ +0x9d fp=0xc421040fc0 sp=0xc421040f78
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc421040fc8 sp=0xc421040fc0
created by*base).runConsumer.func1
goproj/src/ +0x2f6
[goport(/opt/couchbase/bin/cbq-engine)] 2018/08/24 12:56:52 child process exited with status 134

Attached also a section of the query.log when this happens. (60.6 KB)

Any idea?

Post the query , EXPLAIN and index definition. cc @Marco_Greco

We have many different queries that get this, and it happens only some % of the executions.

I’m not trying to resolve this particular query.
My question is - from looking at the errors posted - what is actually breaking (e.g. OOM, Timeout, etc) and what can be done to improve the cluster to handle these cases.

Hi @uris, thank you for providing the log.
I think I have identified an issue, and if you were to provide this particular query, it would help me confirm it and fix it.

I have fixed the issue in 5.5.2 and later
FYR it’s MB-31049
You should be able to download the binary in a couple of weeks or so.