Sync Gateway 1.1.1 High CPU issue and could not respond

Hi,

I am running Couch base server 4.0 with sync gateway 1.1.1 on 5 nodes community version.
(After this issue started then i have upgraded couchbase from 3.0.1 to 4.0 and sync gateway from 1.1.o to 1.1.1)

Since last week Couchbase sync gateway CPU was suddenly goes high and it never come back down unless restart server or restart sync gateway. It happens randomly on 1 or 2 nodes and rest of the nodes cpu utilization less than 5%

When this thing happens some times sync gateway does not respond anymore and only option to restart the server and start sync gateway.

Please see syncgateway logs during high CPU.

15:48:10.052627 2016-03-29T15:48:10.052+11:00 HTTP: #788877: --> 599 Write error: write tcp 172.16.1.76:17733: broken pipe (0.0 ms)
15:48:10.150459 2016-03-29T15:48:10.150+11:00 HTTP: #786922: --> 599 Write error: write tcp 172.16.1.76:9273: broken pipe (0.0 ms)
15:48:10.150843 2016-03-29T15:48:10.150+11:00 HTTP: #786921: --> 599 Write error: write tcp 172.16.1.76:62977: broken pipe (0.0 ms)
15:48:10.151158 2016-03-29T15:48:10.151+11:00 HTTP: #786923: --> 599 Write error: write tcp 172.16.1.76:54098: broken pipe (0.0 ms)
15:48:10.152551 2016-03-29T15:48:10.152+11:00 HTTP: #786924: --> 599 Write error: write tcp 172.16.1.76:31611: broken pipe (0.0 ms)
15:48:10.207185 2016-03-29T15:48:10.207+11:00 HTTP auth failed for username=“user_1750535”

BTW 172.16.1.76 is our hardware load balance IP as external traffic comes through load balance and then forward that traffic to couchbase sync gateway nodes.

I have tried settings revision limit to 20 and increased “maxFileDescriptors” to 10000 from 5000 default limit and also increased open files limit to “20000” in centos.

Please let me know what other things i can try to fix this issue as this is causing system unstable and going offline quite often. If you need more information please inform me.

Thank you.

Regards,
Karunakar

The log information looks like a symptom of the problem (dropped client connections), and not the underlying cause.

There were several CPU utilization improvements included in Syng Gateway version 1.2, so you might try updating and see if it addresses your issue. If not, I’d request that you file an issue in the Sync Gateway github repo (https://github.com/couchbase/sync_gateway/issues/new), and attach more detailed logs there.

Thanks.

Thanks Adam.

I am planning on upgrading Sync gateway from version 1.1.1 to 1.2 and would like to know if i upgrade do i need to upgrade couch lite to 1.2 as well on mobile device?

I have taken out one of the sync gateway node out from load balance and then waited until load balance session go down to “0” and then run below command to check sync gate way connections and i still see 5000+ session on sync gate way but they are not on LB.

[root@LPCOUCHBASE1 bin]# lsof -p 2308 | grep -i established | wc -l
5714

I have also attached one of the sync gate way established connection log and they never go to zero but LB has zero connections.

Do you have any idea why sync gateway process holds those session and only option i have to restart sync gateway to clear them.

LPCOUCHBASE3Putty.zip (86.5 KB)

Thanks,
Karunakar

Hi @Karunakar, we had similar issues, upgrading to 1.2 definitely improved performance. You don’t need to upgrade the client, we are running sync gateway 1.2 and couchbase lite 1.1 in production

Regards,
Vlad

@Karunakar One of the enhancements included in 1.2 is improved detection and release of half-closed connections by Sync Gateway. I think that will address the issue you’re describing.

Thank you Adam and Vlad.

Yes, i am planning on upgrading and it seems like newer version of sync gateway 1.2 start the service automatically and also put logs and config files to different location (/home/sync_gateway) so i am planning on changing location where sync gateway saves logs and would do upgrade tomorrow and also noticed that sync_config file in /etc/init folder and also creating sync gateway user account.

In the mean time i have installed couch mobile 1.2 and tested with sync gateway 1.1.1 and noticed below connections and would like to know is it normal behavior.

Summary of test below:

I would like to know is it normal couch base behavior or something with our app making more connections?

Thanks,
Karunakar

Hi @Karunakar, can you provide some more information, like the type of replication you are using. Is it continuous, or one shot? Is it pull or push. Technically push replication is one shot, it sends the data and then closes the connection. The continuous pull replication keeps a connection open with the server, it can open several connections during bulk download, but then releases them. I was playing with this recently and found a bug in our application which was keeping the connection open through a long running service running in the background. You should make sure that your database manager is a singleton for your application and you are not accidentally starting multiple replications on multiple instances of the object.

Hope all of this makes sense

Regards,
Vlad

@karunakar The initial connection count sounds reasonable - Sync Gateway maintains connections with services on the backing Couchbase Server nodes, and if you’ve got a 5 node cluster, that doesn’t sound unreasonable.

As Vlad mentions, the number of active connections triggered per client will depend on the number and type of replications you’ve got set up, as well as things like channel distribution, etc.

Thank you again for your valuable comments and i have informed our developers about it and they are going to look into continuous pull replication.

I hope all things goes to normal once all servers upgraded to 1.2 sync gateway and post you back.

Thanks,
Karunakar