Query node goes down with single query triggered over workbench

bora.mohan · September 28, 2021, 7:31pm

Hi all,

I have created a CTE based n1ql query that works fine in lower environments where the data is very low when compared to PROD.
We have the primary index deleted and hence the query will take some or the other index ( as we have not created any specific index for this particular query)

When I ran this query in prod environment, one of my query node went down and failed over.

My doubt is, even if the query is not picking up right index and the data is huge, isn’t the expected behaviour is to get the query time out ( kept at default time out of 600 ) and not bring the entire query node (only a single service running 34gb ram) down and failed over?

We can only tune a query for index if we have the query service running and able to re-run the query more than once to tune it further.

Please suggest.

vsr1 · September 28, 2021, 8:29pm

CTE must materialized all the data and need to be present till the end of query. Depends on the query may need other memory . Run your CTE separately and see how many documents it generate. Avoid unnecessary fields in CTE. If it is huge you may want consider JOIN vs CTE.
All the results are in memory. In 7.0 there is memory_quota per request The N1QL Query Language Now Has a Per-Request Memory Quota

Kevin.Cherkauer · September 28, 2021, 8:50pm

@bora.mohan Index service has a memory quota and will be very good at staying within this limit, but Query service does not have similar enforcement to limit memory usage. Index can also spill over to disk. Thus it may be to your benefit to create a secondary index for this query.

bora.mohan · September 28, 2021, 11:34pm

This is my query -

WITH nonConcludedRefId AS (SELECT DISTINCT RAW cte.refId
                           FROM bucket1 AS cte 
                            where cte.docType='xxx' and cte.refNo is not missing
                                      and cte.stage <> 'Concluded')
SELECT meta().id as docId, * 
FROM bucket2 use keys (select DISTINCT raw meta(a).id from bucket1 i
                       inner join bucket2 a use hash(build) on i.ref = a.ref
                      let x = (select RAW c from nonConcludedRefId as c )
                     WHERE i.docType='xxx'
                      and a.docType='yyy'
                      and (i.stage = 'Concluded')
                      and a.ref NOT IN x
                      and i.refNo is not missing
                      and STR_TO_MILLIS(a.modifiedTimestamp) < STR_TO_MILLIS(DATE_ADD_STR(CLOCK_LOCAL(), -10, 'day'))
                      and a.modifiedTimestamp between '2021-07-01' and '2021-07-02');

bora.mohan · September 28, 2021, 11:45pm

The CTE returns approx ~1k refId.
Bucket2 has ~90k docs for the date range. Bucket1 ~10k records to innerjoin bucket2.

vsr1 · September 29, 2021, 12:46am

WITH dockeys AS (WITH nonConcludedRefId AS ( SELECT DISTINCT RAW cte.refId
                                             FROM bucket1 AS cte
                                             WHERE cte.docType="xxx" AND cte.refNo IS NOT MISSING
                                                   AND cte.stage <> "Concluded")
                 SELECT DISTINCT RAW meta(a).id
                 FROM bucket1 AS i
                 INNER JOIN bucket2 AS a USE HASH(BUILD) ON i.ref = a.ref
                 WHERE i.docType="xxx"
                       AND a.docType="yyy"
                       AND (i.stage = "Concluded")
                       AND i.refNo IS NOT MISSING
                       AND a.ref NOT IN nonConcludedRefId
                       AND a.modifiedTimestamp < DATE_ADD_STR(CLOCK_LOCAL(), -10, "day")
                       AND a.modifiedTimestamp BETWEEN "2021-07-01" AND "2021-07-02")
SELECT META().id AS docId, *
FROM bucket2 USE KEYS dockeys;

CREATE INDEX ix1 ON bucket1(stage,refNo, ref, refId ) WHERE docType="xxx";
CREATE INDEX ix2 ON bucket2(modifiedTimestamp, ref) WHERE docType="yyy";

Also try PROBE instead of BUILD

BUILD side should be smallest. It depends on how many documents it produce after the constant predicates of bucket is applied

Hope you using 6.5 IN List Handling Improvements in Couchbase Server 6.5 | The Couchbase Blog

OR

Avoid all together JOIN

WITH nonConcludedRefIds AS ( SELECT DISTINCT RAW cte.refId
                            FROM bucket1 AS cte
                            WHERE cte.docType="xxx" AND cte.refNo IS NOT MISSING
                                   AND cte.stage <> "Concluded"),
ConcludedRefIds AS ( SELECT DISTINCT RAW cte1.ref
                    FROM bucket1 AS cte1
                    WHERE cte1.docType="xxx" AND cte1.refNo IS NOT MISSING
                          AND cte1.stage = "Concluded")
SELECT meta(a).id, a.*
FROM bucket2 AS a
WHERE a.docType="yyy"
      AND a.ref NOT IN nonConcludedRefIds
      AND a.ref IN ConcludedRefIds
      AND a.modifiedTimestamp < DATE_ADD_STR(CLOCK_LOCAL(), -10, "day")
      AND a.modifiedTimestamp BETWEEN "2021-07-01" AND "2021-07-02";

CREATE INDEX ix1 ON bucket1(stage, refNo, ref, refId ) WHERE docType="xxx";
CREATE INDEX ix2 ON bucket2(modifiedTimestamp) WHERE docType="yyy";

bora.mohan · September 29, 2021, 2:49am

Thank you for the queries, but I am still not sure why the node has to go down instead of query timeout.
I had plans of optimising my query provided it times out or is not performant.
How do i ensure that a node will not go down in future as there will be ad-hoc queries that will be executed in the system to serve customers.
If primary is deleted and query is running on a non distant index (almost all predicate fields are indexed), it should give me some space to analyse the query and not bring the node down.

Yes, we are using v6.5 EE

vsr1 · September 29, 2021, 3:49am

Before query timeout, OOM killed process. As blog explained you need 7.0 and set request level memory quota.

Topic		Replies	Views
N1QL queries very slow Couchbase Server query , java , n1ql	6	2366	August 25, 2016
N1QL run fast in one node and slow in other node of couchbase cluster with 2 nodes Couchbase Server	13	2502	October 17, 2016
N1ql: Out of memory error. 8093 becomes unresponsive Node.js SDK n1ql , node	7	2527	August 25, 2017
None of N1QL Query complete SQL++ query , n1ql	17	1939	November 6, 2017
Couchbase 4.1 N1QL - perfomance issue SQL++	14	4586	August 31, 2016

Query node goes down with single query triggered over workbench

Related topics