Not able to reproduce the issue. The panic coming from UPDATE statement.
If possible can you run the all UPDATE statements you run one by one and provide which one causes panic?
Itâs hard to do one by one, also because the data has changed already.
The other issue, is I believe load/scale has to do with it - i think if I were to run all these updates one at a time it wonât happen. But we run multiple processes/clients - and thatâs when the problem starts.
I will try to run the following: stop all clients, change max-para, start a single client - and wait 1-2 minutes. if problem not showing, start a second client, etc
HI there.
Any update on this or MB-32412?
We are seeing now a phenomena that data in some queries return what seems to be old or outdated data (this happens both in Python API and N1QL).
Again, itâs hard to pinpoint and reproduce, as it happens just in small percent of times and when system is loaded with large amount of processes of inserts/updates - however - I have a feeling this might be related to the query services continuously restarting.
thanks
We are not able to reproduce internally.
any other logs I can supply to help ?
Hi @uris - just having a look at the stack, I do not believe that a single document update is what is causing the panic - you do need multiple documents being updated within the same statement, where, crucially, specifically a single document needs to be part of the result set twice, and in quick succession.
Do you have any updates with a where clause, rather than use keys, or an update with an array as part of the use keys clause?
If yes, could you provide the update and sample document?
We are building multiple sub-systems (based on our current system and DB) to work with couch. There are multiple processes. The main tow buckets for the purpose of this are events
and users
. Events are coming in from different sources, and then they get processed by several python scripts (multiple instances running in parallel, but they are locking the events and do not process the same one). The events that are pending to be processed are being picked up by using a view .
Below gathered the main UPDATE
statements (that happen on high frequency).
/* --------------------------------------- */
UPSERT INTO `{BUCKET_NAME}` (key _k, value _v)
SELECT
"depl::hourly::{DAY}_" || in1.id _k,
{
"id": in1.id,
"cnt_action1" : SUM(in1.action1_count),
"cnt_action2" : SUM(in1.action2_count)
} _v
FROM
(
SELECT
CASE WHEN e.`type` == "action1" THEN 1 ELSE 0 END as `action1_count`,
CASE WHEN e.`type` == "action2" THEN 1 ELSE 0 END as `action2_count`,
CASE WHEN e.`id` IS NOT NULL THEN TOSTRING(e.`id`) ELSE "-1" END AS `id`
FROM
events e
WHERE
(e.`type` in ["action1","action2"] AND e.`subtype`="redirect")
AND e.event_time_millis >= {EVENT_TIME_FROM}
AND e.event_time_millis < {EVENT_TIME_TO}
) in1
GROUP BY
in1.`id`
ORDER BY
in1.`id`
RETURNING META().id, *;
/* --------------------------------------- */
UPDATE
events
SET
picked_up_for_processing=1547074751327, picked_up_by='01'
WHERE
meta().id in ['0001546896940788-42af08fd-65f5-0001',...]
AND (processed!=true OR processed is missing) AND picked_up_for_processing is missing
LIMIT 200;
/* --------------------------------------- */
UPDATE `events`
SET `processed`=true , `processed_by_types`=IFMISSING(processed_by_types, {}), `processed_by_types`.`internal_lists`=true, `processed_by_types`.`account_stats`=true
UNSET picked_up_for_processing, picked_up_by
WHERE meta().id in ['0001546896940788-42af08fd-65f5-0001', ...];
/* --------------------------------------- */
UPDATE
users
SET
users.array1 = ARRAY_PUT(IFMISSING(users.array1, []), 1),
users.array2 = ARRAY_PUT(IFMISSING(users.array2, []), "item1"),
users.array3 = ARRAY_PUT(IFMISSING(users.array3, []), 8)
WHERE meta().id = 'some_user_id' ;
/* --------------------------------------- */
UPDATE
users USE KEYS 'some_user_id'
SET
users.events = ARRAY_APPEND(IFMISSING(users.events, []), "0001546896940788-42af08fd-65f5-0001"),
last_millis = GREATEST(last_millis, 1546893180000),
last = MILLIS_TO_LOCAL(GREATEST(last_millis, 1546893180000), "1111-11-11 11:11:11"),
cnt = IFMISSING(cnt, 0)+1,
is_signup = True
RETURNING *;
/* --------------------------------------- */
UPDATE events
UNSET picked_up_for_processing, picked_up_by
WHERE type='type1'
AND IFMISSING(processed, false)=false
AND IFMISSING(picked_up_for_processing, NOW_MILLIS())<DATE_ADD_MILLIS(NOW_MILLIS(), -6, "hour")
LIMIT 10000;
/* --------------------------------------- */
UPDATE `users` SET `processed`=true WHERE meta().id in ['some_user_id',...];
/* --------------------------------------- */
I can explain in more details any of the queries. We also have some processes using Python couchbase API (non-N1QL - some using SD API).
Thanks for looking into this.
Guys - any help here would be appreciated.
Also - it might not be something that happens to a lot of people, but clearly there is a problem somewhere - by simply running queries - that most of the time work fine, we manage to get the service panic and restart every few minutes.
This is clearly a bug somewhere and needs attention.
BTW - we just upgraded the whole cluster to 6.0.0 in desperate hope this will help - it didnât.
We need the statement that caused and sample document
Can you run the all UPDATE statements you run one by one and provide which one causes panic?
The system runs several millions update a day in each of the 2 buckets. It will be hard to run and find the one that causes this crash.
Also, iâm 99.9 sure that if I ran them manually one at a time, nothing will crash. Itâs the fact that we run this from multiple processes from multiple client machines at high rate that causes it .
The UPDATEs with meta().id and the IN clause look interesting.
Is it possible that the array in the WHERE clause contain the same document key twice?
Could you share a sample document?
I believe the schema of your document may give me a clue as to the source of the issue.
I doubt there are duplicate IDs in those, as the list of IDs is coming from a view. But we will double check
Re sharing the documents, itâs not straight fwd as there are multiple types in the events bucket (and then thereâs the user doc). Also, the docs have some proprietary and privacy stuff that I will need to clean before sharing, so need to know what it is youâre looking for.
Also, even if there was a double-ID in the array. Are you saying that
SET x.a=x.a+1
WHERE meta().d IN [1,2,2,3];
could break things? (If that were the case, it sounds like a bug in the language). I donât think the same doc gets updated twice in such case. IN is IN - itâs either in or out.
Also tested this :
SELECT SUM(1)
WHERE 'a' IN ['a','b','a'];
-- returns:
[ { "$1": 1 } ]
(and not 2
)
Also - we just verified and all the queries with meta().id IN [..]
have unique IDs.
@Marco_Greco - Can you expand on the WHERE issue you thought was potential problem here? the âINâ part - considering my later comments?
Thanks
Hi @uris - so the long and the short of it is that from the stack, it seems like while one document is being marshalled (transformed from internal format to json) to be written to disc, the same document is being modified by another thread.
In general terms, different requests do not share documents (meaning that, if you fetch the same document in two different requests, each gets its own copy).
Index scans only ever return document once, so normally any meta().id pushed is not an issue.
Even for clauses like
- meta().id in [âsomethingâ, âsomethingâ]
- meta().id = âsomethingâ or meta().id = âsomethingâ
we normalize the spans, and each term only appears in one scan.
This leaves 2 options open
- USE KEYS [âsomethingâ, âsomethingâ]
- meta().id IN [$1, $2]
but $1 and $2 passed from the client are the same value
We have logic in place to detect this kind of situation at the Fetch operator level, and make a copy of the document, and my thinking is that we are either not detecting it properly, or not making the copy properly.
Of course there may be other scenario where the same key is returned by the scan twice, but I havenât thought of it.
Hence the request the actual statement and and sample document.
Like I wrote before - we fire millions of UPDATES and I canât tell you which is the one(s) that trigger the panic.
But I can say that the chances for the SAME update to include the same ID twice is practically zero, as in ALL the requests I mentioned above with âmeta().id INâ - the array is a list of doc-ID returned from a CB-View - which canât (shouldnât ?) return the same doc twice.
What can happen, in odd cases, is two different processes, running their own updates, but with some overlap of IDs. If the database canât handle two requests to update the same record at the same time with locking - I think you have a potential bigger problem. It could be that our specific combination of implementation+scale (large scale with heavy dependancy on multiple updates to same docs) is triggering an edge case that most other customers didnât see.
Hi @uris - I should clarify, whenever I refer to a document, I refer to a document value - a value is an internal n1ql structure that is created in various circumstances, and in the context of this discussion, when a document is read, and the is passed around in a pipeline of operators executing a specific request.
Values are not shared between requests, and are not supposed to be shared between operators in the pipeline - this has no relation to concurrent access of documents in the KV.
In this specific case a value is being accessed by two operators - which is wrong - one being the update_send operator, which is assembling a document, which will later be sent to the KV for storage, the other operator being unknown, but presumably something following the fetch (the document value does not exist as such before the fetch, only the key).
We need a little had isolating what could be the culprit - or if the document values copies are done incorrectly.
I understand that you cannot determine the statement - it is unfortunate that the thread executing the function reporting the issue sits outside the n1ql package, and therefore is not aware of the statement text (otherwise it would be in the log), but it would help us to start with if you are using placeholders or assembling the text with the keys on the fly?
I am asking this because the statement is assembled differently in each case, and knowing which may help me assemble a repro?
Hi Marco,
It got a bit too tech for myself for a sec⌠but re placeholders - we were not using placeholders at all until very recently, but in attempt from recent few days converted 1 -2 queries to using prepared-statements and placeholders. So this was happening without placeholders.
Hi @uris I may have a fix. A new build is on your way for you to tryâŚ