N1ql to upload some large documents via cbq takes a very long time

pccb · January 3, 2025, 5:47pm

We have a file of approximately 270 MB containing around 1600 INSERT statements, each inserting a single JSON document. Each document is relatively large and includes arrays. Executing this file through cbq takes about 45 minutes, which seems excessively slow.

Here’s what we’ve tried so far:

Parallel Execution: Splitting the file and running multiple cbq sessions in parallel reduced execution time, but productionizing this approach would require additional time and effort.
Batch Inserts: We attempted batching by inserting 10 documents in a single INSERT statement, but this didn’t improve the performance.
Using cbimport: While cbimport is significantly faster, it requires the file to be in JSON format, which involves additional preprocessing.
Ruling Out Network and Hardware Constraints:
- To eliminate network overhead, we ran the N1QL queries directly on the machine hosting Couchbase Server.
- We placed the file on a RAM disk to minimize disk I/O overhead.

Despite these efforts, the performance improvement has been minimal. We are keen to understand why processing takes so long. The documents are large, but not excessively so to justify such delays.

Is there any parameter or option in cbq that could help speed up the process? We’ve already tried disabling logging to stdout and other logs, but that didn’t help. Any insights would be greatly appreciated!

mreiche · January 3, 2025, 6:10pm

I’m not sure if it gives information for inserts, but for SELECT, there is a break-down of execution time in completed_requests - Manage and Monitor Queries | Couchbase Docs
My gut feeling is that there is no easy “speed-up” here and that if this needs to be done on a regular basis, the effort to use cbimport would be worthwhile.

If you want to investigate further, I think that concentrating on the performance of a single insert would be sufficient.

dh · January 3, 2025, 9:15pm

Each SQL statement has to be parsed, processed then the document sent to the KV. It would be faster to just send the documents directly to the KV using an SDK programme.

45*60 / 1600 ~ 1.7s per statement. If you gather a profile for a single insert and check the “phaseTimes” section, you may find that most is spent in the “parse” phase. If in the “insert” phase then this is simply the time spent sending the data to the KV.

To improve the performance of the INSERT SQL statement, make sure to use a prepared statement with the values as parameters and pass those - i.e.

cbq> prepare ins from insert into default values(?,?);
{
    "requestID": "b574cb04-93d2-4cf8-abab-092f017d1c92",
    "signature": "json",
    "results": [
    {
        "encoded_plan": "H4sIAAAAAAAA/wEAAP//AAAAAAAAAAA=",
...
cbq> \set -args ["k0",{"a":"this is the doc"}];
cbq> execute ins;
{
    "requestID": "350271d3-5162-4022-9be2-30da0856c170",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "15.068537ms",
        "executionTime": "14.907989ms",
        "resultCount": 0,
        "resultSize": 0,
        "serviceLoad": 3,
        "mutationCount": 1
    }
}
cbq> \set -args ["k1",{"a":"this is also a doc"}];
cbq> execute ins;
{
    "requestID": "eb254463-9899-486d-8b1c-f3a77abae304",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "1.456261ms",
        "executionTime": "1.215045ms",
        "resultCount": 0,
        "resultSize": 0,
        "serviceLoad": 3,
        "mutationCount": 1
    }
}

So basically transform your file that has:

INSERT INTO ks VALUES("key",{doc});
INSERT INTO ks VALUES("key",{doc});
...

into one with:

\set -args ["key",{doc}];
EXECUTE ins;
\set -args ["key",{doc}];
EXECUTE ins;
...

(Prepare “ins” ahead of running the file with cbq-shell.)

HTH.

dh · January 3, 2025, 9:28pm

Also, if your file is:

INSERT INTO ks VALUES("key",{
	"some":["large","document"]
});
INSERT INTO ks VALUES("key",{
	"some":["large"
	,"document"
	],
	"over":"multiple",
	"lines" : true
});

Couldn’t a bit of sed and jq get it in a form suitable for cbimport ?

e.g.

$ cat /tmp/x
INSERT INTO ks VALUES("key",{
        "some":["large","document"]
});
INSERT INTO ks VALUES("key",{
        "some":["large"
        ,"document"
        ],
        "over":"multiple",
        "lines" : true
});

$ sed 's/INSERT INTO ks VALUES(//;s/);$//;s/\("[^"]*"\),{/{"_kf":\1,/' /tmp/x|jq -c
{"_kf":"key","some":["large","document"]}
{"_kf":"key","some":["large","document"],"over":"multiple","lines":true}

which leaves the output suitable for import with something like (assuming you’ve redirected that output to /tmp/transformed.json):

$ cbimport json --format lines -c http://localhost:8091 -u Administrator -p password -g %_kf% -d file:///tmp/transformed.json -b default

HTH.

vsr1 · January 3, 2025, 10:56pm

As @dh mentioned each SQL statement must do parse, plan and execute if values are big and part of statement those also must be parsed.

One way avoid that prepare once with parameters and repeated execution.
There is 5 predefined generic statement(s) query/prepareds/prepareds.go at master · couchbase/query · GitHub

Execute some thing like this.

curl -u user:password http://localhost:8093/query/service -H "Content-Type: application/json" -d '{"prepared":"__insert", "args":["bucket.scope.collection", dockey, document, {}]}'

Also u can use cbc tool directly store in kv

pccb · January 14, 2025, 5:05am

Thanks @dh, we had come up with a similar sed + jq command to transform the file into JSON, which could then be imported using cbimport. While that approach worked, it still required transforming the file, which we were hoping to avoid if possible.
Another reason for posting was our confusion about why it’s so slow—it’s been difficult to pinpoint. Based on the responses, it seems that the parsing is where most of the time is being spent.

Thanks again!

pccb · January 14, 2025, 5:08am

thanks @dh, we’ll give this a try.
Just one quick question — I believe everything can go in a single file, with the prepared statement at the top followed by set -args and the values, right?

gauravj10 · February 13, 2025, 10:30am

@pccb

To summarize:
\set -args..
EXECUTE prep

Is ok to do, but \SET command takes long when the args include a large JSON like the one you have. This is due to
https://jira.issues.couchbase.com/browse/MB-65181

You can download the version of the tool with fix in the upcoming release 7.6.6/7.2.7

Also, another tip to reduce the time taken is to set the history file to be /dev/null using the hist flag Couchbase Docs.

Hope this helps:)

pccb · February 14, 2025, 5:39pm

thanks @gauravj10 , setting hist to /dev/null indeed helped to reduce the time by approx. 30%, much appreciated!

However, this is still not enough for us. Do you have any insights on how much the fix for preparedStmnt is likely to reduce the time? Will it just eliminate the extra delay, or will preparedStmnt perform faster compared to not using it?

mreiche · February 14, 2025, 6:30pm

If you gather a profile for a single insert and check the “phaseTimes” section, you may find that most is spent in the “parse” phase.

gauravj10 · February 18, 2025, 5:16am

Do you have any insights on how much the fix for preparedStmnt is likely to reduce the time? Will it just eliminate the extra delay, or will preparedStmnt perform faster compared to not using it?

The fix eliminates the extra delay.
The preparedStmt itself runs in in a few milliseconds, leaving the time spent in the issue the fix fixes and also writing to history file.

Sharing the experiment I tried out earlier on a mac m1 machine
prep.n1q1 has 1624 \set followed by execute prep(for insert)

time taken without the fix: 25m42.133s
time taken with the fix: 2:07.97
time taken with the fix and writing history to /dev/null: 53.125

the test runs the \set followed by execute prep in a sequential manner.
You could split the file and try parallel execution as suitable, as you have mentioned in an earlier post to further improve performance.

Happy to help:)

pccb · February 18, 2025, 10:54am

thanks @gauravj10 , again!

this should help. we’ll wait for the patch to be out.

sameer.malve · May 5, 2025, 1:22pm

Hi @gauravj10 ,

As per your response you suggested that fix would be in upcoming release of 7.2.7 hence we have taken CB 7.2.7 and test is again but that doesn’t helps although the time for prepare statement has decrease by 11 min but compare to the plain N1QL statement it is higher.

Also the JIRA ‘MB-65181’ mentioned that the fix would be available in 7.2.8 and we found that 7.2.8 is yet to released. so does the CB 7.2.7 has this fix or not ?

				
Default order	No of Docs	Total Time(Insert N1QL Statement) 	Total Time(Insert N1QL (Prepared) Statement)	                                                                                    CB Version
np-data.cb   	1624	45 min 8 sec 877 ms	2 hrs 50 min 15 s 449 ms	CB 7.2.5
np-data.cb   	1624	48 sec	                        2 hrs 37 min 59 sec 859 ms	CB 7.2.7

Regards,
Sameer Malve

vsr1 · May 5, 2025, 8:23pm

Fix available in couchbase-server-7.2.8-8815

Once 7.6.6 released (any time) you can take cbq shell binary from 7.6.6 release and use it in 7.2.x

gauravj10 · May 8, 2025, 5:22am

Thanks for the correction—I had mistakenly mentioned the wrong fix version earlier

gauravj10 · May 21, 2025, 4:54am

Hi @sameer.malve @pccb
7.6.6 is available to download now
https://www.couchbase.com/downloads/?family=couchbase-server

Can you try again now? Thanks.

srinu471 · May 22, 2025, 11:07am

Hi Gaurav,
The Couchbase version 7.6.6 release notes did not mention about MB-65181. Is this fixed in 7.6.6

gauravj10 · May 22, 2025, 1:06pm

The fix should be there in the 7.6.6 release.

The only confirmation I know is the build bot noting the build number where it added:
Build couchbase-server-7.6.6-6052 contains query commit 9a4cad1 with commit message:
[BP-trinity] MB-65181 Reduce memory allocations in trimSpaceInStr()

vsr1 · May 22, 2025, 1:48pm

Fix Is there in 7.6.6
NOTE: Every fix will not be added into release notes. Only important once and change in behavior will be added

srinu471 · June 11, 2025, 10:25am

Hi @vsr1,
We had tested the prepared statement by upgrading the Couchbase server version to 7.6.6-6126(EE).
However, the observation was that the time taken by the prepared statement was still higher than the Simple insert N1QL on CB 7.6.6-6126 (EE).

Default order No of Docs Insert N1QL Time insert N1QL (Prepared)Time*
np-data.cb 1624 29 min 33 min 54 sec
Couchbase Version
CB 7.6.6

Topic		Replies	Views
Insert JSON bulk data via N1QL query Couchbase Server query , n1ql	6	8411	April 11, 2018
When data is inserted in large amounts, the efficiency is particularly slow Couchbase Server n1ql	1	714	August 3, 2018
N1QL poor upsert performance (< 1000 docs/sec) Couchbase Server	6	4801	September 16, 2016
Couchbase Export few Documents Couchbase Server	7	4708	August 7, 2019
N1QL Query Performance issue with Java SDK Java SDK n1ql	8	1418	December 2, 2020

N1ql to upload some large documents via cbq takes a very long time

Related topics