We have a solution where we need to access the bucket from both, web page and mobile devices. There for we are using bucket shadowing.
Bucket is around 2 GB. Sync gw is configured to handle several channels (1 per day, right now we have around 900 channels). Number of revisions in config.json file is 20 revs. We are using 2 server nodes hosted in amazon, instances type are m4.2xlarge. (So disk total usage is 4 around 4 GB). Sync GW version is 1.3.
We were noticing low sync GW performance, so we were told we should solve DB conflicts and we did that, but still we observed sync GW issues. Therefore we decided to make a backup/restore of the original bucket and let sync GW re-sincronize the shadow bucket from scratch. This re-sync process delayed around 2 hours up to the point where we saw similar number of items in both buckets. After that, when the mobile devices (around 30) connected to the bucket, we observed that some new documents delayed up to 12 hours to apear in the server!
In Sync GW log we see a bunch of “No old revision” messages. And it seams like synchronization is consuming a lot of effort to the server as we see some peaks up to 100% of Processor load and also peaks of 2 GB of network in/out.
Any advice is very well appreciated,
Please, I appreciate (and need) your help.
There shouldn’t be a scenario where documents are delayed up to 12 hours, regardless of system load. However, there isn’t really enough information provided here to get an idea about what might be going on. Some additional questions below. For clarity, I’m calling the bucket targeted by Sync Gateway the ‘SG bucket’, and the bucket being shadowed the ‘Source bucket’
- What do you have running on the 2 server nodes? Sync Gateway and Couchbase Server on both?
- Having SG re-synchronize the bucket from scratch generally isn’t helpful/required. However - when you did your backup/restore, which bucket did you delete: the SG bucket, or the Source bucket?
- For the documents that took 12 hours to appear on the server:
- Is that 12 hours to appear in the SG bucket, the Source bucket, or both?
- When did push replication complete, according to the client logs?
- How many documents were being pushed (in total) by the 30 mobile devices?
Thanks for your reply Adamf, here are the answers:
CB server on both and 1 sync GW in one of them.
We backed up the source bucket,
stopped the SGW process,
flushed the SGW bucket
restored the source bucket
and then re-started the SGW process.
It took 2 hours to see the same amount of documents on both buckets (a little less than 900,000 items). But the problem came when the mobile devices started to connect to the SGW, we saw the CPU and network-in/out so busy.
We don’t keep any visible log that can tell when the push replication complete (should we?). But we realized the sync delay due to the fact that we were expecting some documents that were created at around 10 am in on of the mobile device and didn’t appear in the server until around 8-10 pm.
The mobile devices where synchronized with the server before the backup, so I don’t think there were many documents being pushed. Our app cleans the DB and leaves only 1 week old or later documents. Channels avoid mobile devices to download older documents. Total mobile DB may be around 2,000 - 4,000 documents, on each device, but as I said, they were synchronized before the backup.
I appreciate your help,
I forgot to say that we decided to do the backup/restore due to the fact that we were so worried because we were observing high work load on the server (in the main node, where the SGW runs). CPU load and also network-in/out were showing so much activity.
CPU had peaks of almost 100%.
Network in/out had peaks of almost 2 GB.
Is this normal for the amount of data I described before?
Sync GW is behaving oddly, some times it just stops working. CPU continues with high load.
I appreciate any advice, we are kind of lost and really need some clue on what’s going on.
I’m sorry you’re seeing problems. Unfortunately it’s hard to diagnose what exactly is going on given the information that’s been provided.
A few comments/suggestions for followup
Flushing the SG bucket will blow away all client checkpoints, and force clients to restart their replications from zero. That would explain the high amount of traffic from clients after backup/restore - each device would be doing a full push and full pull. This wouldn’t send documents that already exist, but would require the work to identify whether documents are present on the remote system (revs_diff for push, changes for pull).
It sounds like each client is replicating a set of channels, where each channel corresponds to a day. How many documents do you have per (daily) channel? Are these documents all unique to the channel, or would they have been replicated in another day’s channel?
What do you mean when you say Sync Gateway ‘stops working’? Are requests failing? Requests returning unexpected responses?
I’d probably recommend hosting Sync Gateway and Couchbase Server on separate nodes. That will make it easier to identify the source of the high CPU utilization, and avoid a feedback loop where high SG CPU usage results in slower Couchbase Server performance, which then results in additional SG CPU usage as requests back up.