I setup a new cluster with user data but I’m the ony one using it.
1 Nginx loadbalancer; root server: 4GB RAM, 4 cores
3 CB servers; each CB server is a root server, each has 16GB RAM and 4 cores
250k documents, 100% in RAM
Cluster is in idle state as I’m the only one using it
Android CBL 1.3
CB CE 4.1
It’s a list maker for Android. Add/Delete items of lists
Push and pull replication is set to continuous
Changes will be consumed and UI will be updated
My own data is about 20 documents
A change will take about 2-5 seconds until it appears on my second device. Ideally I’d aim for less than 1 second. One reason is because CB is advertised as “Submillisecond latencies at scale” database. So I give it +1 second time for network latency, Android UI update
A more concerning issue is that sometimes a minute goes past and the change was not pulled down by the second device. Here is what I observed:
- make change on device 1
- quickly hit refresh on Couchbase console to view the document in a browser. Changes are visible
- no change on device 2. Jumping between screens back and forth does not help. Leaving app by pressing back button does not help. Removing app from RAM and restarting it does not help. Making another change on device 1 helps.
What logs can I provide to find the issue?
- Routine in Android app to see if the changes arrived but UI wasn’t updated? Specific Android CB methods?
- Other logs?
I’m planning to move my user base to the new cluster this Sunday. It’d be great if I can collect the logs beforehand so that it’s only my logs. I’d very much appreciate a quick response
We quote 2-3 seconds latency. There is some extra time in there for buffering on the device; database transactions are expensive so writing each new revision to the device individually isn’t effective. That said, in a test scenario it should be a bit quicker. I don’t work on the Android version so I can’t quote details, but there may be some more tuning that can be done on the buffering there.
What logs can I provide to find the issue?
Logs from the 2nd Android device, since it sounds like that’s where the problem is.
Logs collected of Android device no 2 and of all 3 SG nodes. I created a private gist and sent it to you via private message. I hope this is OK. You may pass it to a colleague if necessary.
I added comments to the Android log. I only made changes to a list/document on device 1. As soon as a change was not replicated I waited for a bit. Then pressed the back button on device 2 to leave the activity(screen) and opened the list again. Changes appeared. This is in the SG logs, too. There is “Connection lost from client” and “Received unexpected out-of-order change”
@benjamin_glatzeder For the ‘unexpected out-of-order change’ SG log message, can you share the full SG log output for that line? That’s an unusual warning for your scenario.
@adamf of course. Could you explain how to retrieve more SG logs please
My SG config contains:
I cleared the logs before the test, and after the test I downloaded the logs. I did not modify or shorten them
I was asking specifically about the rest of that particular log line - I see it in the data you shared w/ Jens. The ‘unexpected’ error message indicates that Sync Gateway is receiving a sequence over the server mutation stream much later than it expects to.
A few questions to help isolate/reproduce this issue:
- Are you restarting the Sync Gateway nodes prior to your test? From the logs it looks like there have only been ~700 requests against the SG before you hit the issue, but would like to confirm.
- What’s the expected write load during your test (from SG startup time)? Just the ~20 docs that you’re adding via the app, or more?
- What’s the time elapsed from SG startup to the time you ran your test?
- The cluster was newly setup (see first post). As the time of writing and collecting the logs the cluster was just a few hours old
- Data was put into the cluster with cbrestore command. Android apps were reinstalled on device 1 and 2. Init sync downloads ~ 20 docs. Making 1 change in a list/document updates 1 document.
- Just a few hours
I’m going to pack my bags from my current VPS provider to another one. My plan is to quit the loadbalancer at my current VPS provider, then make a backup of the cluster (cbbackup), scp to my new provider, import it via cbrestore, set rule on old loadbalacer to proxy all traffic to new loadbalancer, start old loadbalancer, change DNS entry. There will be downtime of the app, that’s fine.
To set up the new cluster I restored an up-to-date backup, changed Android app code to point to new loadbalancer IP. The new cluster has all user data but I’m the only one using it. Setup details are in first post. I can try the following things and check if there are issues, too, and provide logs
- start cluster with an empty bucket and test syncing
- use CB CE 4.1 and go back to SG 1.2 and test syncing
- other setup ideas to test?
What should I test, or is there other helpful things I can do?
Were the Sync Gateway nodes running when cbrestore was run? From the logs, it looks like a case where Sync Gateway thought the current sequence in the bucket was higher than the actual value - that could happen if the bucket was modified underneath SG.
If that’s not the case, the issue could potentially be related to the backup itself - running cbbackup while Sync Gateway is live can result in inconsistent sequence values, depending on the order that backup processes nodes.
Yes, all 3 nodes were live when cbrestore was running. In more detail all 3 nodes have CB and SG installed. All SG were running when cbrestore was executing. Would you recommend to stop all 3 SG services and then run cbrestore?
Yes - I’d recommend stopping SG before running cbrestore. The general issue is that cbbackup and cbrestore aren’t locking operations - they work their way through the vbuckets/nodes sequentially. In the case that Sync Gateway is live, it has the potential to update data either ahead or behind of the restore, which would result in the inconsistencies you’re seeing.
I did this last Saturday.
- deleted bucket, created new bucket
- started SG
And then I had the same issues. I didn’t collect logs, sorry! Then I removed SG 1.3, installed SG 1.2 and had no issues. Then I deleted the bucket, created new bucket and tested my app with an empty bucket and SG 1.3. I didn’t have issues then.
I’m very interested in using the latest SG version. What logs can I collect for you? What steps should I follow?
I still suspect that the root cause is related to data inconsistency in the backup data (due to Sync Gateway running while the backup was taken).
If there are any scenarios where you’re able to reproduce this with an empty bucket, I’d like to see the Sync Gateway logs. The log settings as described should be sufficient.
OK, got it! I could do this tonight (Europe time).
So I’d do the following:
- turn all SGs off. On Ubuntu that would be
service sync_gateway stop
- then run
- turn SGs back on
- transfer the backup to the new cluster
cbrestore while SGs are off at the new cluster
- then start SGs and test sync on two devices
Are these steps correct?
And now I’m also interested in the “data inconsistency in the backup data (due to Sync Gateway running while the backup was taken).” What happens if I ever need a backup because a disaster happened. I can’t rely on the backup if it has data inconsistency
That sequence of steps is correct, assuming that your database is starting in a good state (i.e. it’s not starting from a cbrestore that may have issues).
Currently the recommended approach for backing up a Sync Gateway bucket is to take backups while there are no write operations happening through Sync Gateway.
OK! I finished testing. It worked great for about times (tally list). Then it didn’t sync anymore. Switching between screens, backing out of the app, or removing the app from RAM did not force a sync. But changing a document (by pressing a button in the app) started the sync again.
I might have made another mistake after I restored the backup. I installed SG on all machines at the same time; I didn’t know that after installing the SG will start straightaway
The backup may have errors. For a few weeks I had trouble with the stability of the cluster. Often nodes ran out of RAM, etc. After upgrading the nodes these issues were gone. The database was never restored and is the same from the beginning. Also you might remember my other issue which might indicate that the database has issues. Is there anyway to tell? And is there anyway to fix issues?
In any case I collected logs and send you a private message. Thanks!