Sync gateway taking 30 seconds to compute changes_view?

Hi, we are having some major performance problems with sync gateway, to the point that it is unusable. The issue seems to be some channel filtering we setup to try and filter documents by time for syncing. On our development server everything runs fine, we see logs like this when a user logs in and starts syncing:

2015-09-03T18:23:45.097Z changes_view: Query took 224.6128ms to return 110 rows, options = db.Body{"stale":false, "startkey":[]interface {}{"channel1", 0x1}, "endkey":[]interface {}{"channel1", 0x10342}}

On production though we see logs like this (30 seconds per channel!):

18:11:36.459583 2015-09-03T18:11:36.459Z changes_view: Query took 29.0641465s to return 3 rows, options = db.Body{"stale":false, "startkey":[]interface {}{"channel1", 0x1}, "endkey":[]interface {}{"channel1", 0xbc747}}

With our new setup some users have 100s of channels, while others only have 10 or so. Syncing takes forever though, and I can see sync gateway is using 51% of memory on one of our nodes, and 35% on another, and it’s running very slowly. Before we did this channel filtering it used usually less than 10% of memory. Can sync gateway not handle large amounts of channels in general, do we require a more powerful server? We’re currently using ubuntu 12.04, 4 2.0 GHz cores and 8 GB of RAM (but couchbase is also running on the same node and using about 4 GB of the RAM).

What are the differences between your development and production servers? Just the volume of data? Are they both running the same builds of Sync Gateway and Couchbase Server?

Yes, it’s just the volume of data and load on the server. On development it was only a few of us hitting the server every few hours. On production a hundred requests are coming every second due to all the longpoll requests. On dev we have about 200000 documents, but on prod we have about 2 million. I also noticed these types of calls taking a long time, but not sure what it is referring to:

2015/09/03 20:32:03 go-couchbase: call to ViewCustom("sync_gateway", "channels") in took 13.2068436s

These types of logs are saying they take anywhere from 8 to 17 seconds

I think you’re on Sync Gateway 1.1.0 (correct me if I’m wrong) - can you let me know what version of Couchbase Server you are running?

That particular view request (“channels”) is used when Sync Gateway’s in-memory cache of recent changes for a channels doesn’t cover the full time range requested by the user. There are probably two things to investigate - how frequently are clients requesting data that isn’t in the recent change cache, and why the view is taking so long to run.

A couple of questions:

  1. You mention that users have up to 100s of channels. How many total channels do you expect in the system (i.e. are there 100s of channels shared between all users, or far more channels with each channel only granted to a small number of users).
  2. Typically initial sync might trigger the view call (when users are synchronizing since=0), but subsequent replications will be able to get what they need from the in-memory cache. It’s worth investigating whether subsequent replications are triggering the same view-based retrieval. If so, what’s the use case that’s driving retrieval of data that’s not in the recent cache?

We’re using Couchbase server enterprise v3.0.2 and sync gateway enterprise v1.1.0 on ubuntu 12.04, in a cluster with two nodes (both have same couchbase and sync gateway versions).

  1. We are creating time based channels that are weekly. So each user has a channel for each week they posted a time-sensitive document. We have about 3000 active users currently, so there could be anywhere from 30000 to 300000 channels for now. This number will continue to grow over time. Since sync gateway only allows filtering by channels, and a user must have access to the channels they are filtering by, we need a channel for each week for each user. Normally a user would only have access at any one time to 5 - 10 channels, 3 of those being the time based channels for last week, this week, and next week. An owner of a group must have access to all the users in their group, as well as their time based channels, so they could have access to 300 or 400 channels (or more).
  2. We were testing a user logging in to the application after logging out. So they have lots of data but needed to logout for whatever reason, and then log back in and do a full sync. This isn’t a highly likely use case since users tend to login and stay logged in forever, but it is a possible use case and will occur occasionally. We also just did a huge resync (took about 12 hours to run) to assign these time based channels so the users have them for the first time.

A couple of thoughts/questions:

  1. There’s a significant improvement in stale=false view query performance in Couchbase server 3.1.0 (vs. 3.0.x), so that might provide some improvement for you.

  2. So when your users log out, you need to delete their local DBs and resync from scratch? Is that to support multiple users sharing a DB?

  3. Is the rationale for the weekly channels to ensure that users don’t have any access to documents older than last week? Or just to improve synchronization performance by ignoring older documents?

  1. Thanks, we might try that on a dev environment first since upgrading the server itself could have other repercussions.

  2. I’ll ask out app team why we delete all their data, but when a user logs out generally that means they want to switch to someone else using the device so you don’t want the old user’s data on their anymore. In general when someone logs out of an application it’s a best practice to clear their data so that someone else can’t access it without permission.

  3. It’s to improve synchronization performance by ignoring older documents. If we synced all their data for all time then for group owners it could be so much data that the sync takes hours, and the app may crash trying to index the data once it’s loaded. Currently group owners cannot log into the app because of this, which is why we wanted to introduce time based syncing. As time goes on users are going to be generating more and more data too, so if a long time user logged out and then logged back in it would be a problem.

As a possible solution is there a way to sync some documents, then pull the rest of the data as JSON documents from our server’s API, and still be able to store the pulled data in the couchbase lite db? Basically can we store existing documents in the local couchbase lite db correctly so long as we specify the id and rev fields in the JSON document? That way we could bypass sync gateway, keep user channels to just a couple, and have the power of our own filtering logic. Or if we just sent the ids of the documents to pull to the app, could it retrieve a list of specified documents through couchbase lite ok?

One of the strengths of the replication approach used by Couchbase Lite/Sync Gateway is avoiding the full resynchronization of documents every time a user logs in/uses the application. The usual approach for client-side multi-tenancy is to have a separate client DB for each user, instead of deleting data and resynchronizing from scratch each time you switch uses. Moving to that approach sounds like it could mitigate your synchronization problems related to older documents. There would still be the overhead of the initial synchronization, but it would be a one-time hit.

That still won’t solve the issue where data gets corrupted and requires a full resync though. E.g. One of our android clients has ~40k documents. His app currently won’t sync, it just keeps calling _changes with the same since value over and over and won’t sync any changes. We need to do a full resync, but with 40k documents that takes a very long time and it generally crashes a few times in the middle because it runs out of memory. Also there is the Android client bug where it erroneously reports that the sync is complete while it is still running, so it keeps trying to re-index while new documents are coming in. So reducing the number of documents synced is really important for cases like this where the only fix is to uninstall the app and reinstall it. There is a separate bug in that even after uninstalling the app and reinstalling it still keeps calling _changes instead of doing an actual sync, but that is a separate issue.

We also have a concept of group owners. Currently they don’t login to devices because it would be too much data, so the members of their groups have been logged in on their own personal devices creating more and more data. We are trying to update our app so that the group owners can login once and then the members can use their device to do stuff and generate data. We can’t do this for existing group owners though because they have too much data already. Similarly if a new user is added to the list of group owners then they will need to login for the first time with tons of data. Also if a group owner buys another device from us and wants to login they need to be able to do that.

We love the fact that the local db exists on the device and doesn’t require the user to be online, and as you mentioned only really needs to be synced once. But there will always be cases where someone with a lot of existing data needs to login and sync. Since channel filtering is very limited in functionality, and doesn’t work in a production setting for us we really need a different way to filter. I doubt we are the only app out there too, since way back when I first posted about how to filter by time I was contacted within a day by someone else who was trying to do the exact same thing.

This sounds like a separate issue that was fixed in the latest version of the Android client. @hideki Am I remembering that correctly?

@borrrden Agreed - this sounds like, which is fixed in release 1.1.1.