Pull replication stops working after 5 minutes sitting idle

I’m experiencing a persistent issue where pull replication stops working after 5 minutes of the device sitting idle. Our device is currently using CouchbaseLite 2.8.1 for Android, but I’ve been able to replicate this behavior in a test app that is using CouchbaseLite 2.7.0 for Android as well.

Here’s a detailed breakdown of the problem, and what I’ve discovered through testing over the past few days:

  • I open the same workbook on two devices, and let them both sit idle for 5+ minutes.
  • I then write on one device (A), but the other device (B) does not receive the change.
  • If I write on the device (B) that did not receive the remote change, it will then pull in the change from (A).

One conclusion I’ve reached is that writing to the device after it’s been sitting idle for 5+ minutes will “wake up” the replicator, causing it to pull in any remote changes that are pending. Furthermore, I did a test where I wrote a document to the database every 3 minutes, and the problem went away – pulling in remote changes did not stop after 5 minutes. So this suggests that the replicator goes into some kind of dormant state after that time?

The replicator itself is a continuous push-and-pull type. During testing, I also set logging to verbose, but I see no change in the replicator status after the 5+ minutes. i.e. After opening the database and configuring the replicator, it correctly goes into the idle state. Then after the 5 minutes, I write on the other device, but nothing changes on the device I’m expecting the remote change to come into – there is no change of the replicator state to busy, and nothing to suggest it is pulling in data. That is, until I do a write to that database, which then touches the replicator and “wakes it up”.

Lastly, there is an iOS version of the app (though it’s currently still on CouchbaseLite 1.4), and pull replication works flawlessly on that platform. For legacy reasons, use of web sockets is explicitly turned off on iOS, but I turned them on so it uses web sockets a well, because I began wondering whether the issue is somehow specific to web sockets and its connection to Sync Gateway. However, pull replication continued to work just fine on iOS.

Regarding web sockets and Sync Gateway, I did come across some information that states that a replicator sends a heartbeat every 5 minutes to keep the connection open, and that any load balancers will need to have a higher value configured for its timeout. I can confirm that our load balancer’s timeout is set to 7200 minutes, so that doesn’t seem to be the issue (though it is curious that the heartbeat time coincides with the time I observe pull replication to stop working).

I’m still not sure exactly what the source of this problem is, whether it’s a Sync Gateway configuration issue, or an issue specific to the Android SDK (2.x)? I’m hoping someone might have some insight or ideas about this issue.

Thanks!

5 min is the default heartbeat interval and so if your land balancer’s keep alive is less than that, you will see websockets connection closing prematurely. I realize that you’ve checked this but would recommend that you double check your LB settings as described here . This has almost always been the cause ( the other instances have been buggy websockets implementations on LB typically resolved by an upgrade- perhaps not an issue here ) .

Other tips
What do the LB logs say ? That would give you insight into what’s going on.

Upgrade to 2.8.1 of CBL Android and retry .

We are using CBL 2.8.1, so I’ve tested with that SDK as well as 2.7.0. In fact, I spent most of today testing on our various supported platforms, and here’s what I’ve found:

  • CBL 2.7.0 on Android tablet device – pull replication stops working after 5 minutes of being idle
  • CBL 2.7.0 on Android emulator – pull replication stops working after 5 minutes of being idle
  • CBL 2.8.1 and Android tablet device – pull replication stops working after 5 minutes of being idle
  • CBL 1.4 on iPhone 8 – pull replication continues working
  • CBL 1.4 on iPhone simulator – pull replication continues working
  • CBL 2.8.1 on iPhone simulator – pull replication continues working

So this issue is not present on iOS, which seems to rule out it being a load balancer problem. Could there be an issue in the Android SDK heartbeat functionality of keeping the web socket alive? It’s looking more like that to me after confirming that there is no issue on iOS.

I will check with our dev ops team next week to get confirmation on our load balancer settings, and to check the load balancer logs.

@blake.meike Related to issue above - Anything that rings a bell on Android side ?

There are a few things that this might be. Some of these might sound irrelevant but… can I ask:

  1. is either of the devices plugged in to a power source?
  2. is either face down?
  3. can I get a description of the devices? Sounds like they are OEM. What version of Android?
  4. can I get a description of the network arrangement? Is this p2p or is there a SGW in between
  5. can I get logcats from both devices for the entire period from the last successful transfer to the first successful transfer after the device recovers.
  6. can I see the code that starts the replicator?

The first two questions are to rule out Doze. Since you tried this on an emulator, I doubt that that is the issue. Still would like to rule it out absolutely.

Thanks for getting back to me on this. And just to reiterate (after continued testing), we are able to reproduce this issue 100% of the time on our devices that use the Android 2.7.0/2.8.1 SDK. The iOS side, however, is fine.

  1. In my testing, yes, the device is plugged into my computer via micro USB, but the issue also happens on devices not plugged in.
  2. No.
  3. Our tablet device is an Eink tablet with an ARM Cortex-A53 (armv7I) processor, running Android 5.1.
  4. We use Sync Gateway.
  5. Here is the logcat for the tablet device exhibiting the problem (link). In the log you can see that the replicator goes into the IDLE state (as expected) at 10:13. At 10:19, changes were made on another device that should have been pulled in on this device, but as you can see, there is no activity by the replicators after 10:13. I don’t have the log output of the other participating device at this time.
  6. First we do a one-shot download (pull only), and once that is done, switch to a continuous push&pull mode. Here is the code snippet: link

Got it. Thanks.

It looks to me as if there is no activity in process 16612 for 7 minutes, after the transition to Idle at 10:23. Is that what you expect? The app is completely inactive, except for the replication?

I also see that at 10:13:24, the device is, in fact, dozing and has gone to sleep.

I would love to see the log from the SGW, to see what it thinks about its connection to this device…

Again, a couple more random questions just to be sure I’m on the right page:

  • ReplicatorListener, in the code snippet, is your class, right?
  • What is your goal in running a one-shot pull before running the continuous repls?

Yes, the app was completely sitting idle for those 7 minutes. Even though I was not interacting with the app on the device itself, changes were made on another device around the 10:19 mark that should have been pulled in by the replicator on this device. But that did not happen.

ReplicatorListener is my class, yes.

We run a one-shot pull when a cached (local) database does not exist on the device. In other words, the entire database needs to be pulled down to the device. In that case, we do the one-shot pull, then go into sync mode. If the database already exists locally, then we go straight to continuous push&pull.

I’ll need to check with Dev Ops tomorrow to see if I can get SG logs for you.

Here is the Sync Gateway log for the same period of time (the timestamp in the SGW log is UTC, while it’s local time for the logcat I provided yesterday). (link)

Thank you.

I am becoming convinced that this is simply Android Doze.

The device that connects to the SGW at 2020-11-25T17:12:58.506Z is still connected 8 minutes later. I don’t know which device that is, but if it is the stalled device, it is not stalled because the SGW has forgotten it.

There are some other things that are worth checking.

  1. In the SGW log, I see only one CBL 2.x device connecting. Is the other device 1.x or perhaps something else?
  2. I’m not seeing the verbose level logging that I expect to see from CouchbaseLite Core. I expect to see Verbose level logging. I’m not seeing any Core logs at all, in the log you provided, just Java logging. This makes it difficult to trace the interactions between CBL and the SGW.

That’s what we were thinking initially too, that it had something to do with Android sleep, but the fact that I can replicate this running on an emulator in Android Studio disproves that I would think.

  1. Yes, the iOS device that was generating the changes that should have been pulled in by the device running Android is still on CBL 1.4.
  2. I specified verbose logging via Database.log.getConsole().setLogLevel(LogLevel.VERBOSE). Isn’t that the “V” tag that we see in the logs?

This morning I recreated the issue running on an Android emulator in Android Studio (with the other application running on iOS), and captured those logs (including from SGW).

Here is the logcat from Android studio, filtered for the application only, with verbose logging enabled for Couchbase: link
Here is the logcat from Android studio, no application filter, with verbose logging enabled for Couchbase, from the latter part of the application-only log above (the whole un-filtered log is too big and noisy): link
And lastly, here is the SGW logs for that period of time: link

I don’t have the logs from the iOS device because I can’t for the life of me remember how to turn on verbose logging in CBL 1.4, and those docs are gone… But from the SGW log, we can see that at around 9:21 (again these logs are in UTC time), the iOS client pushes changes to the database that are received by SGW, but these changes are not pulled in by the Android device.

(And just to clarify, the steps I took here was to open the database and establish the connection to Sync Gateway on both devices, write a stroke on the iOS device to ensure replication is operational [it was], then wait for 5 minutes and write another stroke on iOS. As you can probably see from the logs, the waiting starts at around 9:14:45, then I waited until 9:21 to draw a stroke on iOS, and at that point, pull replication on the Android device was no longer functional.)

Got it. Will pursue.

I’ve created a Jira ticket to track this:

Great. Thanks very much!