While I was importing data into my local setup for development, I’ve noticed that the query service died silently. I have a test script to replicate it in my local server.
My local couchbase and bucket configuration
Cluster RAM Quota
Data RAM Quota: 8586 MB
Index RAM Quota: 1000 MB
Per Node RAM Quota: 2236
I’ve uploaded my test script at https://github.com/moon0326/cbn1qltest
Steps to replicate
- Prepare a bucket and name it ‘test’
- npm install
- node load.js
- node test.js
step #4 kills the server for about 30 seconds then the server becomes accessible again. The problem is that query service does not come up again and there is no log.
When I access cbq, it throws the following error message.
ERROR 100 : Get http://localhost:8093/admin/clusters/default/nodes: dial tcp 127.0.0.1:8093: connection refused
Here are screenshots: http://imgur.com/a/c7P3M
I have a few questions.
- Why is it going down? Without log, I have no clue at all.
- It seems that the query service does not come up again, but the server shows “UP” with green mark. If this happens, how do we detect it?
- How can I prevent this?
Which Couchbase version? @prasad @keshav_m
I tested it on 4.5.0 enterprise.
Got it. So you are killing the server deliberately as part of a test. Can you put a delay after the server comes back up, say 30 seconds? Maybe there is a delay in the server starting up the query service. Also, how do you know that the Query Service was enabled initially?
Well, it wasn’t my intention at first. I found out after running my import script (real one) then created this test script to confirm.
Before running the test script, I’ve confirmed that the query tab comes up as well as cbq. They both worked just fine.
I waited to see if the query service comes up, but it did not.
My only concern is that…there is no way to find this out if this happens in prod because server reports “up” and there is no log in the log tab.
So does the Query Service come up eventually?
No, it does not (at least within a few mins).
Ok, thanks. @prasad @keshav_m @Prerna.Manaktala we should try to reproduce this.
FYI, I was not able to reproduce it when I just ran the script. I had to reduce “numberOfItems” variable value to 200000 in “test.js”. I got “CouchbaseError: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout” when I had higher #. Thank you for the replies over the weekend
I ran the script again and waited 5 mins so far. The server is still down. 10 mins so far, still down.
Able to reproduce this issue with 4.5.0-2601 on mac os.
Cool. Please a file a bug and let’s troubleshoot and fix it. Thx.
I investigated this issue yesterday. The underlying problem is that the system runs out of file handles when the test is run, and this causes the query engine to lock up. It doesn’t die, it just stops responding. Using ps -ax | grep cbq-engine to find the query engine process, followed by kill -9 <process_id> to kill it fixes the problem; the babysitter brings up another query engine immediately.
I’m not entirely sure what is going wrong, but interestingly, the problem goes away if I move the Node instance and test code to another machine. Then the query engine runs just fine. That makes me suspect the problem is either in Node itself or possibly in the client-side code the test is using.
I should add that even with Node moved to another machine, the test is somewhat sensitive. Things worked fine with numberOfItems set to 1000, but when I increased it to 10,000, no requests reached the query engine at all. Nothing went wrong on the server side; there just weren’t any requests. Node didn’t crash or display any errors, either, so I’m not sure what was happening on the client side. (I’ve never used Node before.)
I recommend finding some way to control the degree of parallelism from the client side, perhaps using a task queue. Indiscriminately throwing hundreds of thousands of simultaneous requests at a server is asking for trouble.
Thank you for the post!
If a local node can do that, doesn’t that mean that the query service would behave the same under heavy load?
That was my only concern.
If my hypothesis that the Node.js instance itself is consuming the file handles is correct then no, this problem would not happen with the client off-machine, as it typically would be in a production deployment.
We have never observed behavior like this in production, but if you see something suspicious by all means drop us a line.
That is nice to hear. As long as this is local node issue, then I think everything is good.
Thank you for the response.