Max-parallelism (and all other admin settings) going back to defaults all the time

vsr1 · December 24, 2018, 4:20pm

Not able to reproduce the issue. The panic coming from UPDATE statement.
If possible can you run the all UPDATE statements you run one by one and provide which one causes panic?

uris · December 24, 2018, 4:23pm

It’s hard to do one by one, also because the data has changed already.

The other issue, is I believe load/scale has to do with it - i think if I were to run all these updates one at a time it won’t happen. But we run multiple processes/clients - and that’s when the problem starts.

I will try to run the following: stop all clients, change max-para, start a single client - and wait 1-2 minutes. if problem not showing, start a second client, etc

uris · January 9, 2019, 3:27pm

HI there.

Any update on this or MB-32412?

We are seeing now a phenomena that data in some queries return what seems to be old or outdated data (this happens both in Python API and N1QL).
Again, it’s hard to pinpoint and reproduce, as it happens just in small percent of times and when system is loaded with large amount of processes of inserts/updates - however - I have a feeling this might be related to the query services continuously restarting.

thanks

vsr1 · January 9, 2019, 4:43pm

We are not able to reproduce internally.

uris · January 9, 2019, 5:04pm

any other logs I can supply to help ?

Marco_Greco · January 9, 2019, 6:47pm

Hi @uris - just having a look at the stack, I do not believe that a single document update is what is causing the panic - you do need multiple documents being updated within the same statement, where, crucially, specifically a single document needs to be part of the result set twice, and in quick succession.
Do you have any updates with a where clause, rather than use keys, or an update with an array as part of the use keys clause?
If yes, could you provide the update and sample document?

uris · January 10, 2019, 2:36am

We are building multiple sub-systems (based on our current system and DB) to work with couch. There are multiple processes. The main tow buckets for the purpose of this are events and users. Events are coming in from different sources, and then they get processed by several python scripts (multiple instances running in parallel, but they are locking the events and do not process the same one). The events that are pending to be processed are being picked up by using a view .

Below gathered the main UPDATE statements (that happen on high frequency).

/* --------------------------------------- */

UPSERT INTO `{BUCKET_NAME}` (key _k, value _v)
SELECT
	"depl::hourly::{DAY}_" || in1.id  _k,
	{
		"id": in1.id,
		"cnt_action1"    : SUM(in1.action1_count),
		"cnt_action2"     : SUM(in1.action2_count) 
    } _v
FROM
(
	SELECT
		CASE WHEN e.`type` == "action1" THEN 1 ELSE 0 END as `action1_count`,
		CASE WHEN e.`type` == "action2" THEN 1 ELSE 0 END as `action2_count`,		
		CASE WHEN e.`id` IS NOT NULL THEN TOSTRING(e.`id`) ELSE "-1" END AS `id`
	FROM
		events e 
	WHERE 
		(e.`type` in ["action1","action2"] AND e.`subtype`="redirect")	
		AND e.event_time_millis >= {EVENT_TIME_FROM}
		AND e.event_time_millis <  {EVENT_TIME_TO}	
) in1
GROUP BY
	in1.`id`
ORDER BY
	in1.`id`
RETURNING META().id, *;

/* --------------------------------------- */

UPDATE 
    events
SET 
    picked_up_for_processing=1547074751327, picked_up_by='01' 
WHERE 
    meta().id in ['0001546896940788-42af08fd-65f5-0001',...]
    AND (processed!=true OR processed is missing) AND picked_up_for_processing is missing
LIMIT 200;

/* --------------------------------------- */

UPDATE `events` 
SET `processed`=true , `processed_by_types`=IFMISSING(processed_by_types, {}), `processed_by_types`.`internal_lists`=true, `processed_by_types`.`account_stats`=true 
UNSET picked_up_for_processing, picked_up_by 
WHERE meta().id in ['0001546896940788-42af08fd-65f5-0001', ...];

/* --------------------------------------- */

UPDATE
    users
SET 
    users.array1 = ARRAY_PUT(IFMISSING(users.array1, []), 1),
    users.array2 = ARRAY_PUT(IFMISSING(users.array2, []), "item1"), 
    users.array3 = ARRAY_PUT(IFMISSING(users.array3, []), 8)
WHERE meta().id = 'some_user_id' ;

/* --------------------------------------- */

UPDATE
    users USE KEYS 'some_user_id' 
SET 
    users.events = ARRAY_APPEND(IFMISSING(users.events, []), "0001546896940788-42af08fd-65f5-0001"),
    last_millis = GREATEST(last_millis, 1546893180000),
    last = MILLIS_TO_LOCAL(GREATEST(last_millis, 1546893180000), "1111-11-11 11:11:11"),
    cnt = IFMISSING(cnt, 0)+1,
    is_signup = True
RETURNING *;

/* --------------------------------------- */

UPDATE events
UNSET picked_up_for_processing, picked_up_by
WHERE type='type1'
    AND IFMISSING(processed, false)=false
    AND IFMISSING(picked_up_for_processing, NOW_MILLIS())<DATE_ADD_MILLIS(NOW_MILLIS(), -6, "hour")
LIMIT 10000;

/* --------------------------------------- */

UPDATE `users` SET `processed`=true WHERE meta().id in ['some_user_id',...];

/* --------------------------------------- */

I can explain in more details any of the queries. We also have some processes using Python couchbase API (non-N1QL - some using SD API).

Thanks for looking into this.

uris · January 10, 2019, 6:52pm

Guys - any help here would be appreciated.

Also - it might not be something that happens to a lot of people, but clearly there is a problem somewhere - by simply running queries - that most of the time work fine, we manage to get the service panic and restart every few minutes.

This is clearly a bug somewhere and needs attention.

BTW - we just upgraded the whole cluster to 6.0.0 in desperate hope this will help - it didn’t.

vsr1 · January 10, 2019, 7:54pm

We need the statement that caused and sample document
Can you run the all UPDATE statements you run one by one and provide which one causes panic?

uris · January 10, 2019, 7:57pm

The system runs several millions update a day in each of the 2 buckets. It will be hard to run and find the one that causes this crash.
Also, i’m 99.9 sure that if I ran them manually one at a time, nothing will crash. It’s the fact that we run this from multiple processes from multiple client machines at high rate that causes it .

Marco_Greco · January 10, 2019, 10:05pm

The UPDATEs with meta().id and the IN clause look interesting.
Is it possible that the array in the WHERE clause contain the same document key twice?
Could you share a sample document?
I believe the schema of your document may give me a clue as to the source of the issue.

uris · January 11, 2019, 6:44pm

I doubt there are duplicate IDs in those, as the list of IDs is coming from a view. But we will double check

uris · January 11, 2019, 7:00pm

Re sharing the documents, it’s not straight fwd as there are multiple types in the events bucket (and then there’s the user doc). Also, the docs have some proprietary and privacy stuff that I will need to clean before sharing, so need to know what it is you’re looking for.

Also, even if there was a double-ID in the array. Are you saying that

SET x.a=x.a+1
WHERE meta().d IN [1,2,2,3];

could break things? (If that were the case, it sounds like a bug in the language). I don’t think the same doc gets updated twice in such case. IN is IN - it’s either in or out.

Also tested this :

SELECT SUM(1)
WHERE 'a' IN ['a','b','a'];
-- returns:
[ { "$1": 1 } ]

(and not 2)

uris · January 11, 2019, 7:53pm

Also - we just verified and all the queries with meta().id IN [..] have unique IDs.

uris · January 14, 2019, 10:39pm

@Marco_Greco - Can you expand on the WHERE issue you thought was potential problem here? the “IN” part - considering my later comments?

Thanks

Marco_Greco · January 16, 2019, 2:54pm

Hi @uris - so the long and the short of it is that from the stack, it seems like while one document is being marshalled (transformed from internal format to json) to be written to disc, the same document is being modified by another thread.

In general terms, different requests do not share documents (meaning that, if you fetch the same document in two different requests, each gets its own copy).

Index scans only ever return document once, so normally any meta().id pushed is not an issue.
Even for clauses like

meta().id in [“something”, “something”]
meta().id = “something” or meta().id = “something”

we normalize the spans, and each term only appears in one scan.
This leaves 2 options open

USE KEYS [“something”, “something”]
meta().id IN [$1, $2]
but $1 and $2 passed from the client are the same value

We have logic in place to detect this kind of situation at the Fetch operator level, and make a copy of the document, and my thinking is that we are either not detecting it properly, or not making the copy properly.
Of course there may be other scenario where the same key is returned by the scan twice, but I haven’t thought of it.
Hence the request the actual statement and and sample document.

uris · January 16, 2019, 6:36pm

Like I wrote before - we fire millions of UPDATES and I can’t tell you which is the one(s) that trigger the panic.
But I can say that the chances for the SAME update to include the same ID twice is practically zero, as in ALL the requests I mentioned above with “meta().id IN” - the array is a list of doc-ID returned from a CB-View - which can’t (shouldn’t ?) return the same doc twice.

What can happen, in odd cases, is two different processes, running their own updates, but with some overlap of IDs. If the database can’t handle two requests to update the same record at the same time with locking - I think you have a potential bigger problem. It could be that our specific combination of implementation+scale (large scale with heavy dependancy on multiple updates to same docs) is triggering an edge case that most other customers didn’t see.

Marco_Greco · January 17, 2019, 5:55pm

Hi @uris - I should clarify, whenever I refer to a document, I refer to a document value - a value is an internal n1ql structure that is created in various circumstances, and in the context of this discussion, when a document is read, and the is passed around in a pipeline of operators executing a specific request.

Values are not shared between requests, and are not supposed to be shared between operators in the pipeline - this has no relation to concurrent access of documents in the KV.

In this specific case a value is being accessed by two operators - which is wrong - one being the update_send operator, which is assembling a document, which will later be sent to the KV for storage, the other operator being unknown, but presumably something following the fetch (the document value does not exist as such before the fetch, only the key).

We need a little had isolating what could be the culprit - or if the document values copies are done incorrectly.

I understand that you cannot determine the statement - it is unfortunate that the thread executing the function reporting the issue sits outside the n1ql package, and therefore is not aware of the statement text (otherwise it would be in the log), but it would help us to start with if you are using placeholders or assembling the text with the keys on the fly?

I am asking this because the statement is assembled differently in each case, and knowing which may help me assemble a repro?

uris · January 17, 2019, 8:56pm

Hi Marco,

It got a bit too tech for myself for a sec… but re placeholders - we were not using placeholders at all until very recently, but in attempt from recent few days converted 1 -2 queries to using prepared-statements and placeholders. So this was happening without placeholders.

Marco_Greco · February 4, 2019, 7:10pm

Hi @uris I may have a fix. A new build is on your way for you to try…

Topic		Replies	Views
Tune up parallelism setting for query service SQL++ n1ql	5	2319	July 19, 2018
N1QL REST API - Max Parallelism Couchbase Server n1ql	2	2758	January 27, 2016
N1ql max parallelism SQL++	7	1108	April 14, 2021
Couchbase max parallelism Couchbase Server	17	6243	May 10, 2017
High CPU usage on query services Couchbase Server	2	2299	April 26, 2016

Max-parallelism (and all other admin settings) going back to defaults all the time

Related topics