CBL 2.6: Does SG need more RAM? Initially only?

Setup:
CBL 2.7 CE Android snapshot from around 1 week ago.
SG 2.6

Current RAM usage:

1 week ago:

Most users started to use the lastest app version with the CBL 2.6 today. Before it was CBL 1.4. I’m unsure what is happening at the moment. Continuous sync stopped working. One-time pull sync works fine. Restarting the SG service on both machines allows continuous sync to start again. I’ll add another machine with SG service and keep an eye on this issue.

The primary feature released in Sync Gateway 2.6 was channel cache enhancements.
I can’t answer your question easily without understanding your usage of channels, as well as your configuration.

I’d recommend reading more about the new cache features here, it may give you some clues:
https://docs.couchbase.com/sync-gateway/2.6/index.html
https://docs.couchbase.com/sync-gateway/2.6/deployment.html#channel-and-revision-cache

Thank you for the links. I read through the documentation. The channel and revision cache is for EE versions only. That is fine. I think it has nothing to do with SG 2.6. I was running this version for some time already. The only change in the last days is that an Android app update rolled out with CBL 2.6 - previously it was CBL 1.4.
Now I have 3 SG nodes and each uses between 500MB and 1000MB. So there’s plenty of RAM unallocated. Sync works. It has been all good now for the last ~3 hours.

My SG config has been:

{
  "interface": ":4984",
  "adminInterface": ":4985",
  "maxFileDescriptors": 250000,
  "logging": {
    "log_file_path": "/home/sync_gateway/logs",
    "console": {
      "log_level": "info",
      "log_keys": [
        "HTTP"
      ],
      "color_enabled": true
    },
    "error": {
      "enabled": true,
      "rotation": {
        "max_size": 100,
        "max_age": 360,
        "localtime": false
      }
    },
    "warn": {
      "enabled": true,
      "rotation": {
        "max_size": 100,
        "max_age": 180,
        "localtime": false
      }
    },
    "info": {
      "enabled": true,
      "rotation": {
        "max_size": 100,
        "max_age": 6,
        "localtime": false
      }
    },
    "debug": {
      "enabled": false,
      "rotation": {
        "max_size": 100,
        "max_age": 2,
        "localtime": false
      }
    }
  },
  "databases": {
    "my_database": {
    "server": "couchbase://10.0.0.30,10.0.0.31,10.0.0.32",
    "bucket": "my_bucket",
      "password": "pass",
      "username": "sync_gateway_user",
      "num_index_replicas": 0,
      "revs_limit": 400,
      "allow_empty_password": true,
      "users": {
        "GUEST": {
          "disabled": true,
          "admin_channels": [
            "*"
          ]
        }
      },
      "cache": {
          "channel_cache_max_length": 1000
      },
      "enable_shared_bucket_access": true,
      "import_docs": true, <-- set at one SG instance only
      "sync":
`
function sync_gateway_fn(doc, oldDoc) {

I’d like to follow up and report that Thursday (following day) and Friday all looked good. Here’s 6 hours from Friday. Click on the picture to see all 3 SG nodes and their RAM usage.

Saturday morning was different. All SG services rapidly allocated RAM and syncing between clients did not work. I needed to restart all 3 SG nodes. Here’s a screenshot, again click on the picture to see all 3 nodes.

I still hope that this issue will go away on its own once all users updated to the latest Android app version with CBL 2.x.

This is very strange behaviour that I’d like to get to the bottom of, given it was stable for a couple of days, but then suddenly ate up a bunch of memory. Maybe a client is doing something weird which is causing SG to leak memory somehow…

If possible, next time this happens - before restarting each node, can you grab a couple of profiles from each node using these URLs:

This should let us know what is using up memory inside Sync Gateway, which hopefully we can track down a fix for.

1 Like

Just to clarify I’ll download the files with

wget http://localhost:4985/_debug/pprof/heap
wget http://localhost:4985/_debug/pprof/goroutine

file heap and file goroutine prints

heap: gzip compressed data, max speed
goroutine: gzip compressed data, max speed

And then gunzip failed:

gzip: heap: unknown suffix -- ignored

Moving heap to heap.gz and then running gunzip does work but the source is still unreadable. Once this occurs again can I attach these files here? Or did I download them incorrectly?

(Also the issue occured again but I didn’t get the files at that time.)

Hey, these are compressed protobuf files intended to be read by pprof.
If you leave them compressed, they can be fed into pprof directly.

In just the last couple of minutes the sync stopped working again. With that I mean that the CBL replicator status shows “BUSY” on all connected clients. Web-App features like signing up new users, etc, fails using the SG REST API.

Current setup: only 1 SG node. Backstory: I had two SG nodes incase one fails. After the BUSY issues appeared I added a third SG node. As this didn’t help I scaled down to a single SG node.

Here are the files. I had to zip them as this forum only allows certain file types to be uploaded. The files were not compressed, just saved.

goroutine.zip (6.3 KB) heap.zip (472.4 KB)

Screenshot of RAM usage for the last 6 hours. CPU and network resources are mostly unused. The last spike in the RAM graph is when I restarted the SG service.

SG_RAM_recent

Based on the goroutine profile, I see around 200 clients connected, but nearly 12,000 revisions stuck waiting to get a sequence number to be written to the database.

This seems like it might be a client trying to push something, timing out, reconnecting and retrying, pushing more and more stuff into the queue. We fixed an issue that may be causing this in an upcoming version of Sync Gateway (https://github.com/couchbase/sync_gateway/pull/4222). We expect this release to ship early 2020.

I’d expect adding more SG nodes to help this situation. Did you see exactly the same behaviour? Or just more spread out over the nodes. I presume you’re routing traffic to SG thorugh a load balancer so you can distribute requests from clients?

Interesting find! I tried to render the two files (heap and goroutine) myself by using https://github.com/gperftools/gperftools. But I failed. Could you please share the command I’d need to execute? Will I be able to find the user id of the user who pushes all those revisions?

Back to the issue at hand. Since there was nearly no extra RAM allocated but the BUSY issue appeared I think this is linked to: Replicator in Android get stuck at busy state when it goes back to online from offline

At the time when I had 3 SG nodes (currently I have 1 SG node) the BUSY issue appeared, too. There was a big BUT though! I had observed this issue with several clients connected to the cluster (with a loadbalancer, you are correct!). Some clients (group A) were able to update documents and push them up. Other clients (group B) could update documents locally but the changes were not pushed up. The sync status with those devices was stuck at BUSY. But these clients (group B) received document updates from group A and displayed them. With that I mean that most likely one SG node had the issue you are describing in your last post but other SG nodes had not. This makes it more difficult to figure out which SG service to restart. And only restarting solves the issue.

At the momemt I’m coding a makeshift app which periodically updates a specific document, waits a few seconds, checks the replicator status and if it is still in the BUSY state the SG service is restarted via ssh’ing into the SG node.

Tagging @humpback_whale and @fatmonkey45: Could it be that you face a similiar issue where many, many revisions got stuck and this results in the BUSY state. See the previous post please.

You can use pprof to render the profile: GitHub - google/pprof: pprof is a tool for visualization and analysis of profiling data
like this: $ pprof -http=":6060" goroutine.out

You can use the Sync Gateway logs to identify which client is doing what, by looking at the username, and the context IDs.

In this example, it shows a device pulling documents from SG. You can see username demo and the context ID [5b31e98] corresponding to that replication:

2019-01-09T16:49:38.623Z [INF] HTTP:  #001: GET /travel-sample/_blipsync (as demo)
2019-01-09T16:49:38.623Z [INF] HTTP+: #001:     --> 101 [5b31e98] Upgraded to BLIP+WebSocket protocol. User:demo.  (0.0 ms)
2019-01-09T16:49:38.623Z [INF] WS: c:[5b31e98] Start BLIP/Websocket handler
2019-01-09T16:49:38.624Z [DBG] WS+: c:[5b31e98] Sender starting
2019-01-09T16:49:38.624Z [DBG] WS+: c:[5b31e98] Received frame: MSG#1 (flags=       0, length=62)
2019-01-09T16:49:38.624Z [DBG] WSFrame+: c:[5b31e98] Incoming BLIP Request: MSG#1
2019-01-09T16:49:38.624Z [INF] SyncMsg: c:[5b31e98] #1: Type:getCheckpoint Client:cp-1cintY0NZiJ19PEsMC8205UKrZU= User:demo
2019-01-09T16:49:38.625Z [DBG] SyncMsg+: c:[5b31e98] #1: Type:getCheckpoint   --> OK Time:837.554µs User:demo
2019-01-09T16:49:38.625Z [DBG] WS+: c:[5b31e98] Push RPY#1
2019-01-09T16:49:38.626Z [DBG] WS+: c:[5b31e98] Sending frame: RPY#1 (flags=       1, size=   68)
2019-01-09T16:49:38.627Z [DBG] WS+: c:[5b31e98] Received frame: MSG#2 (flags=       0, length=109)
2019-01-09T16:49:38.627Z [DBG] WSFrame+: c:[5b31e98] Incoming BLIP Request: MSG#2
2019-01-09T16:49:38.628Z [INF] SyncMsg: c:[5b31e98] #2: Type:subChanges Since:8386 Continuous:true Filter:sync_gateway/bychannel Channels:channel.demo  User:demo
2019-01-09T16:49:38.628Z [DBG] WS+: c:[5b31e98] Push RPY#2
2019-01-09T16:49:38.628Z [DBG] WS+: c:[5b31e98] Sending frame: RPY#2 (flags=       1, size=    1)
2019-01-09T16:49:38.628Z [INF] Sync: c:[5b31e98] Sending changes since 8386. User:demo
2019-01-09T16:49:38.628Z [DBG] Changes+: c:[5b31e98] Int sequence multi changes feed...
1 Like

Hello

I was notified about an edit in the last post…

I was able to create the Go call stack back in November and also went through the logs but this didn’t help finding the issue. When the service hangs and needs to be restarted I save goroutine and heap files. There were more than 100 automatic restarts and I have all the files saved if they are needed.
I was contacted by a user who has a bigger dataset (I believe it was about 1000 documents*). I also was unable to one-push replicate the dataset to my production cluster. It did work when pushing it to my dev cluster which is a lot less powerful but only used by me. The production cluster is also very much overprovisioned. All resources rather idle on all nodes.

I think my best option is to wait for the next SG release.

*Documents in my data model are small. Say on average about 5 fields, rarely more than 1 level deep (nesting). Average character count maybe around 200.

I installed SG version 2.7 (Couchbase Sync Gateway/2.7.0(166;a52b9c4) CE) and was now able to push the quoted dataset. That’s great! This might be an indicator that the issue is solved.

I still have a service running which checks if the SG service stops working properly. I’ll update once my service doesn’t detect this behaviour anymore.

Hi @benjamin_glatzeder, That’s great news!

Apologies for the length of time this issue has persisted for. I believe your particular situation was down to a combination of issues fixed in 2.7.0, the main one being a deadlock in the sequence allocator that is only triggered under certain load.

Please do let us know if you run into any other issues with 2.7.0

There were no more issues since upgrading to SG version 2.7. Great work all around!

1 Like