I recently updated an existing service and published a new one using Couchbase Node.js SDK v 3.0.1 and 3.0.7 and I noticed that both services got OOMkilled error in Kubernetes cluster. This is the shape of memory consumption of one of services, the new one, that is doing almost nothing:
The service wait for a message from a RabbitMQ amqp client to perform some operation on a cluster/bucket/collection that is opened once when the process start. Same RabbitMQ amqp client code is used in other services that have a flat mem profile. Both affected services have a liveness probe, the probes work in really different ways in two services, I now increased the frequency of one of them to exclude they can be related t0 the increasing mem consumption.
Anyone experiencing something like this?
Taking a look at this and trying to loop in the right person to evaluate this.
Thanks @ericb for taking this into account.
I drastically decreased liveness probing frequency and see no change in memory profile, so I feel I can exclude that the slope is related to it. Activating SDK’s logging I see the following log repeating several times in a minute:
2020-11-17T19:07:14.062Z couchnode:lcb:trace (bootstrap @ ../deps/lcb/src/bootstrap.cc:169) Background-polling for new configuration
2020-11-17T19:07:14.062Z couchnode:lcb:trace (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:298) Refreshing current cluster map (bucket: dedalo-quality)
2020-11-17T19:07:14.062Z couchnode:lcb:trace (server @ ../deps/lcb/src/mcserver/mcserver.cc:880) <cb-xxxxxx-0000.cb-xxxxxx.couchbase.svc:11210> (CTX=0x5601b3b5ffe0,memcached,SRV=0x5601b3d06640,IX=0) Scheduling next timeout for 2500 ms. This is not an error
2020-11-17T19:07:14.062Z couchnode:lcb:trace (server @ ../deps/lcb/src/mcserver/mcserver.cc:880) <cb-xxxxxx-0001.cb-xxxxxx.couchbase.svc:11210> (CTX=0x5601b3e47e20,memcached,SRV=0x5601b3d07560,IX=1) Scheduling next timeout for 2500 ms. This is not an error
2020-11-17T19:07:14.062Z couchnode:lcb:trace (server @ ../deps/lcb/src/mcserver/mcserver.cc:880) <cb-xxxxxx-0002.cb-xxxxxx.couchbase.svc:11210> (CTX=0x5601b3e4a6e0,memcached,SRV=0x5601b3d07e00,IX=2) Scheduling next timeout for 2500 ms. This is not an error
2020-11-17T19:07:14.062Z couchnode:lcb:trace (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:157) Not applying configuration received via CCCP (bucket=dedalo-quality). No changes detected. A.rev=62, B.rev=62
2020-11-17T19:07:14.062Z couchnode:lcb:trace (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:284) Attempting to retrieve cluster map via CCCP
2020-11-17T19:07:14.062Z couchnode:lcb:trace (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:150) Re-Issuing CCCP Command on server struct 0x5601b3d06640 (cb-xxxxxx-0000.cb-xxxxxx.couchbase.svc:11210)
2020-11-17T19:07:14.063Z couchnode:lcb:trace (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:157) Not applying configuration received via CCCP (bucket=dedalo-quality). No changes detected. A.rev=62, B.rev=62
Doesn’t sound to be an issue, but it’s a constant SDK activity that might cause memory consumption to increase. I’m now going to disable RabbitMQ connection from my code to exclude any other possible responsibility in memory increase.
Excluding RabbiMQ client, same memory profile, so it seems to be related to CB SDK.
Can you try with the latest release on npm (3.0.7)? There was a fix to one of our underlying libraries (debug) related to leaking of memory.
my latest tests (without liveness nor RabbitMQ) were performed with SDK v 3.0.7.
We are able to replicate this on our K8 infrastructure, everything was working perfectly with V 2.6 and now we have updated to V 3.0.7 and started seeing OOM from the last 2 weeks. Also, node processes are taking more resources than before.
We ended up increasing 1.5x processor and memory and it seems better now but I am still seeing OOM randomly after node CB Client update.
Hey @paolo.morgano , Have you found any workaround or solution?
haven’t found a workaround yet. It it seems we have encountered an issue with memory leak in debug logging system as detailed here in JSCBC-820. The fix should be included in release version
3.1.0 of the SDKs. Meanwhile I disabled logging, but no advantage raised from this.
Thanks @paolo.morgano for quick response.
Let’s hope we will have 3.1.0 release sooner than later.
Hey @paolo.morgano, @RishiKapadia,
Take a look here for more information about the memory leak and the fix. Note that a workaround is to pass a custom logging function to the Node.js SDK to prevent our automatic use of the debug library (which is where the leak exists); you can also workaround it by explicitly installing the latest debug library version which has a fix (4.3.1); or finally you can pull the master of couchnode which has our own internal fix to avoid the leak. Lastly, the 3.1.0 release should be out sometime in the next 24 hours and containsh both an updated debug library, and our internal fix,
Thank you @brett19, We will update to 3.1.0 as soon as it releases and tests on our staging environment. I am so much excited to try that and see the performance.
Hey @RishiKapadia, 3.1.0 is published, just working on the release notes now
Thank you so much @brett19, That was quick, and yes I have just updated to 3.1.0 and pushed to the staging environment, I am testing it.
Awesome. Let me know if you still see the leak, if there is another leak I didn’t see, I’ll try and track it down and slip in a 3.1.1 to fix it this week before anyone notices
So far I haven’t noticed any issue, we have pushed this to our prod environments, and node processes looks much more stable now.
And appreciate your willingness to fix the leak issue if we find another.
Thanks @brett19, we updated to 3.1.0 too and leak seems to be gone.