N1ql query service dies silently

moon0326 · October 15, 2016, 3:30am

Hello,

While I was importing data into my local setup for development, I’ve noticed that the query service died silently. I have a test script to replicate it in my local server.

My local couchbase and bucket configuration

Cluster RAM Quota
Data RAM Quota: 8586 MB
Index RAM Quota: 1000 MB

Bucket
Per Node RAM Quota: 2236

I’ve uploaded my test script at https://github.com/moon0326/cbn1qltest

Steps to replicate

Prepare a bucket and name it ‘test’
npm install
node load.js
node test.js

step #4 kills the server for about 30 seconds then the server becomes accessible again. The problem is that query service does not come up again and there is no log.

When I access cbq, it throws the following error message.

 ERROR 100 : Get http://localhost:8093/admin/clusters/default/nodes: dial tcp 127.0.0.1:8093: connection refused

Here are screenshots: http://imgur.com/a/c7P3M

I have a few questions.

Why is it going down? Without log, I have no clue at all.
It seems that the query service does not come up again, but the server shows “UP” with green mark. If this happens, how do we detect it?
How can I prevent this?

geraldss · October 15, 2016, 2:33pm

Which Couchbase version? @prasad @keshav_m

moon0326 · October 15, 2016, 7:37pm

Hello,

I tested it on 4.5.0 enterprise.

Thanks,
Moon

geraldss · October 16, 2016, 4:12pm

Got it. So you are killing the server deliberately as part of a test. Can you put a delay after the server comes back up, say 30 seconds? Maybe there is a delay in the server starting up the query service. Also, how do you know that the Query Service was enabled initially?

moon0326 · October 16, 2016, 11:34pm

Hello,

Well, it wasn’t my intention at first. I found out after running my import script (real one) then created this test script to confirm.

Before running the test script, I’ve confirmed that the query tab comes up as well as cbq. They both worked just fine.

I waited to see if the query service comes up, but it did not.

My only concern is that…there is no way to find this out if this happens in prod because server reports “up” and there is no log in the log tab.

geraldss · October 16, 2016, 11:41pm

So does the Query Service come up eventually?

moon0326 · October 16, 2016, 11:45pm

No, it does not (at least within a few mins).

geraldss · October 16, 2016, 11:52pm

Ok, thanks. @prasad @keshav_m @Prerna.Manaktala we should try to reproduce this.

moon0326 · October 16, 2016, 11:57pm

FYI, I was not able to reproduce it when I just ran the script. I had to reduce “numberOfItems” variable value to 200000 in “test.js”. I got “CouchbaseError: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout” when I had higher #. Thank you for the replies over the weekend

I ran the script again and waited 5 mins so far. The server is still down. 10 mins so far, still down.

Prerna.Manaktala · October 17, 2016, 6:43am

Able to reproduce this issue with 4.5.0-2601 on mac os.

geraldss · October 17, 2016, 2:07pm

Cool. Please a file a bug and let’s troubleshoot and fix it. Thx.

johan_larson · October 20, 2016, 1:10pm

I investigated this issue yesterday. The underlying problem is that the system runs out of file handles when the test is run, and this causes the query engine to lock up. It doesn’t die, it just stops responding. Using ps -ax | grep cbq-engine to find the query engine process, followed by kill -9 <process_id> to kill it fixes the problem; the babysitter brings up another query engine immediately.

I’m not entirely sure what is going wrong, but interestingly, the problem goes away if I move the Node instance and test code to another machine. Then the query engine runs just fine. That makes me suspect the problem is either in Node itself or possibly in the client-side code the test is using.

I should add that even with Node moved to another machine, the test is somewhat sensitive. Things worked fine with numberOfItems set to 1000, but when I increased it to 10,000, no requests reached the query engine at all. Nothing went wrong on the server side; there just weren’t any requests. Node didn’t crash or display any errors, either, so I’m not sure what was happening on the client side. (I’ve never used Node before.)

I recommend finding some way to control the degree of parallelism from the client side, perhaps using a task queue. Indiscriminately throwing hundreds of thousands of simultaneous requests at a server is asking for trouble.

moon0326 · October 20, 2016, 6:45pm

Hi @johan_larson

Thank you for the post!

If a local node can do that, doesn’t that mean that the query service would behave the same under heavy load?

That was my only concern.

johan_larson · October 26, 2016, 7:02pm

@moon0326
If my hypothesis that the Node.js instance itself is consuming the file handles is correct then no, this problem would not happen with the client off-machine, as it typically would be in a production deployment.

We have never observed behavior like this in production, but if you see something suspicious by all means drop us a line.

moon0326 · October 26, 2016, 7:38pm

That is nice to hear. As long as this is local node issue, then I think everything is good.

Thank you for the response.

Topic		Replies	Views
Service 'query' exited with status 134. Restarting. Messages: Couchbase Server query , n1ql , server	4	1485	September 19, 2018
Couchbase queries stop working after adding a 2nd node to the cluster SQL++ query , n1ql , node	6	1390	May 8, 2018
Want to turn on query services on server to run N1QL SQL++ client , query , n1ql , server	4	5696	December 13, 2018
Couchbase N1QL service fails randomly on setup (single cluster, local machine) Couchbase Server query , n1ql , sdk	2	1181	January 11, 2021
High CPU usage on query services Couchbase Server	2	2305	April 26, 2016

N1ql query service dies silently

Related topics