Max-parallelism (and all other admin settings) going back to defaults all the time

Not able to reproduce the issue. The panic coming from UPDATE statement.
If possible can you run the all UPDATE statements you run one by one and provide which one causes panic?

It’s hard to do one by one, also because the data has changed already.

The other issue, is I believe load/scale has to do with it - i think if I were to run all these updates one at a time it won’t happen. But we run multiple processes/clients - and that’s when the problem starts.

I will try to run the following: stop all clients, change max-para, start a single client - and wait 1-2 minutes. if problem not showing, start a second client, etc

HI there.

Any update on this or MB-32412?

We are seeing now a phenomena that data in some queries return what seems to be old or outdated data (this happens both in Python API and N1QL).
Again, it’s hard to pinpoint and reproduce, as it happens just in small percent of times and when system is loaded with large amount of processes of inserts/updates - however - I have a feeling this might be related to the query services continuously restarting.

thanks

We are not able to reproduce internally.

any other logs I can supply to help ?

Hi @uris - just having a look at the stack, I do not believe that a single document update is what is causing the panic - you do need multiple documents being updated within the same statement, where, crucially, specifically a single document needs to be part of the result set twice, and in quick succession.
Do you have any updates with a where clause, rather than use keys, or an update with an array as part of the use keys clause?
If yes, could you provide the update and sample document?

1 Like

We are building multiple sub-systems (based on our current system and DB) to work with couch. There are multiple processes. The main tow buckets for the purpose of this are events and users. Events are coming in from different sources, and then they get processed by several python scripts (multiple instances running in parallel, but they are locking the events and do not process the same one). The events that are pending to be processed are being picked up by using a view .

Below gathered the main UPDATE statements (that happen on high frequency).

/* --------------------------------------- */

UPSERT INTO `{BUCKET_NAME}` (key _k, value _v)
SELECT
	"depl::hourly::{DAY}_" || in1.id  _k,
	{
		"id": in1.id,
		"cnt_action1"    : SUM(in1.action1_count),
		"cnt_action2"     : SUM(in1.action2_count) 
    } _v
FROM
(
	SELECT
		CASE WHEN e.`type` == "action1" THEN 1 ELSE 0 END as `action1_count`,
		CASE WHEN e.`type` == "action2" THEN 1 ELSE 0 END as `action2_count`,		
		CASE WHEN e.`id` IS NOT NULL THEN TOSTRING(e.`id`) ELSE "-1" END AS `id`
	FROM
		events e 
	WHERE 
		(e.`type` in ["action1","action2"] AND e.`subtype`="redirect")	
		AND e.event_time_millis >= {EVENT_TIME_FROM}
		AND e.event_time_millis <  {EVENT_TIME_TO}	
) in1
GROUP BY
	in1.`id`
ORDER BY
	in1.`id`
RETURNING META().id, *;

/* --------------------------------------- */

UPDATE 
    events
SET 
    picked_up_for_processing=1547074751327, picked_up_by='01' 
WHERE 
    meta().id in ['0001546896940788-42af08fd-65f5-0001',...]
    AND (processed!=true OR processed is missing) AND picked_up_for_processing is missing
LIMIT 200;

/* --------------------------------------- */

UPDATE `events` 
SET `processed`=true , `processed_by_types`=IFMISSING(processed_by_types, {}), `processed_by_types`.`internal_lists`=true, `processed_by_types`.`account_stats`=true 
UNSET picked_up_for_processing, picked_up_by 
WHERE meta().id in ['0001546896940788-42af08fd-65f5-0001', ...];

/* --------------------------------------- */

UPDATE
    users
SET 
    users.array1 = ARRAY_PUT(IFMISSING(users.array1, []), 1),
    users.array2 = ARRAY_PUT(IFMISSING(users.array2, []), "item1"), 
    users.array3 = ARRAY_PUT(IFMISSING(users.array3, []), 8)
WHERE meta().id = 'some_user_id' ;

/* --------------------------------------- */

UPDATE
    users USE KEYS 'some_user_id' 
SET 
    users.events = ARRAY_APPEND(IFMISSING(users.events, []), "0001546896940788-42af08fd-65f5-0001"),
    last_millis = GREATEST(last_millis, 1546893180000),
    last = MILLIS_TO_LOCAL(GREATEST(last_millis, 1546893180000), "1111-11-11 11:11:11"),
    cnt = IFMISSING(cnt, 0)+1,
    is_signup = True
RETURNING *;

/* --------------------------------------- */

UPDATE events
UNSET picked_up_for_processing, picked_up_by
WHERE type='type1'
    AND IFMISSING(processed, false)=false
    AND IFMISSING(picked_up_for_processing, NOW_MILLIS())<DATE_ADD_MILLIS(NOW_MILLIS(), -6, "hour")
LIMIT 10000;

/* --------------------------------------- */

UPDATE `users` SET `processed`=true WHERE meta().id in ['some_user_id',...];

/* --------------------------------------- */

I can explain in more details any of the queries. We also have some processes using Python couchbase API (non-N1QL - some using SD API).

Thanks for looking into this.

Guys - any help here would be appreciated.

Also - it might not be something that happens to a lot of people, but clearly there is a problem somewhere - by simply running queries - that most of the time work fine, we manage to get the service panic and restart every few minutes.

This is clearly a bug somewhere and needs attention.

BTW - we just upgraded the whole cluster to 6.0.0 in desperate hope this will help - it didn’t.

We need the statement that caused and sample document
Can you run the all UPDATE statements you run one by one and provide which one causes panic?

The system runs several millions update a day in each of the 2 buckets. It will be hard to run and find the one that causes this crash.
Also, i’m 99.9 sure that if I ran them manually one at a time, nothing will crash. It’s the fact that we run this from multiple processes from multiple client machines at high rate that causes it .

The UPDATEs with meta().id and the IN clause look interesting.
Is it possible that the array in the WHERE clause contain the same document key twice?
Could you share a sample document?
I believe the schema of your document may give me a clue as to the source of the issue.

I doubt there are duplicate IDs in those, as the list of IDs is coming from a view. But we will double check

Re sharing the documents, it’s not straight fwd as there are multiple types in the events bucket (and then there’s the user doc). Also, the docs have some proprietary and privacy stuff that I will need to clean before sharing, so need to know what it is you’re looking for.

Also, even if there was a double-ID in the array. Are you saying that

SET x.a=x.a+1
WHERE meta().d IN [1,2,2,3];

could break things? (If that were the case, it sounds like a bug in the language). I don’t think the same doc gets updated twice in such case. IN is IN - it’s either in or out.

Also tested this :

SELECT SUM(1)
WHERE 'a' IN ['a','b','a'];
-- returns:
[ { "$1": 1 } ]

(and not 2)

Also - we just verified and all the queries with meta().id IN [..] have unique IDs.

@Marco_Greco - Can you expand on the WHERE issue you thought was potential problem here? the “IN” part - considering my later comments?

Thanks

Hi @uris - so the long and the short of it is that from the stack, it seems like while one document is being marshalled (transformed from internal format to json) to be written to disc, the same document is being modified by another thread.

In general terms, different requests do not share documents (meaning that, if you fetch the same document in two different requests, each gets its own copy).

Index scans only ever return document once, so normally any meta().id pushed is not an issue.
Even for clauses like

  • meta().id in [“something”, “something”]
  • meta().id = “something” or meta().id = “something”

we normalize the spans, and each term only appears in one scan.
This leaves 2 options open

  • USE KEYS [“something”, “something”]
  • meta().id IN [$1, $2]
    but $1 and $2 passed from the client are the same value

We have logic in place to detect this kind of situation at the Fetch operator level, and make a copy of the document, and my thinking is that we are either not detecting it properly, or not making the copy properly.
Of course there may be other scenario where the same key is returned by the scan twice, but I haven’t thought of it.
Hence the request the actual statement and and sample document.

Like I wrote before - we fire millions of UPDATES and I can’t tell you which is the one(s) that trigger the panic.
But I can say that the chances for the SAME update to include the same ID twice is practically zero, as in ALL the requests I mentioned above with “meta().id IN” - the array is a list of doc-ID returned from a CB-View - which can’t (shouldn’t ?) return the same doc twice.

What can happen, in odd cases, is two different processes, running their own updates, but with some overlap of IDs. If the database can’t handle two requests to update the same record at the same time with locking - I think you have a potential bigger problem. It could be that our specific combination of implementation+scale (large scale with heavy dependancy on multiple updates to same docs) is triggering an edge case that most other customers didn’t see.

Hi @uris - I should clarify, whenever I refer to a document, I refer to a document value - a value is an internal n1ql structure that is created in various circumstances, and in the context of this discussion, when a document is read, and the is passed around in a pipeline of operators executing a specific request.

Values are not shared between requests, and are not supposed to be shared between operators in the pipeline - this has no relation to concurrent access of documents in the KV.

In this specific case a value is being accessed by two operators - which is wrong - one being the update_send operator, which is assembling a document, which will later be sent to the KV for storage, the other operator being unknown, but presumably something following the fetch (the document value does not exist as such before the fetch, only the key).

We need a little had isolating what could be the culprit - or if the document values copies are done incorrectly.

I understand that you cannot determine the statement - it is unfortunate that the thread executing the function reporting the issue sits outside the n1ql package, and therefore is not aware of the statement text (otherwise it would be in the log), but it would help us to start with if you are using placeholders or assembling the text with the keys on the fly?

I am asking this because the statement is assembled differently in each case, and knowing which may help me assemble a repro?

Hi Marco,

It got a bit too tech for myself for a sec… but re placeholders - we were not using placeholders at all until very recently, but in attempt from recent few days converted 1 -2 queries to using prepared-statements and placeholders. So this was happening without placeholders.

Hi @uris I may have a fix. A new build is on your way for you to try…