Couchbase Lite C stop replicate after network interruption

Dear Sir,

We got a problem to replacte data from Couchbase Lite C to Couchbase server.

  • The setup:

    1. The server is Couchbase Server Community 7.0 ~7.2, run on Machine A
    2. SyncGateway Community is 2.8~3.03, run on Machine A
    3. the Couchbase Lite C Community is 3.0~3.1. Continuous replicate mode; push only; documents expire after 48 hours, run on Machine B
    4. both Machine A and Machine B are Ubuntu Desktop 20.04LTS, with fixed ip address assignment.
  • Normal Operation:
    We push only data from local DB to server, everything is ok,

  • Abormal Behavior:
    After confirm replcation is working properly, we disconnect the Ethernet of Machine A
    then reconnect Etherdnet of Machine A
    disconnect for short time, replication will work ok.
    disconnect for 0.5~2.5 HR, replication will stop

  • What we found:

    1. the replicator activity level may stuck at any one of OFFLINE, CONNECTING, IDLE, BUSY
    2. couchbase lite c application program, replicator stop start cycle will not resume the replication
    3. stop-start the couchbase lite c application program will resume the replication
    4. restart the sync gateway will not resume the replication.
    5. here is a “ls -l ~sync_gateway/logs/bootstrap” of our test:
      166 Sep 7 16:41 sg_error.log
      3711825 Sep 7 17:18 sg_info.log
      6870131 Sep 7 21:15 sg_stats.log
      166 Sep 7 16:41 sg_warn.log
      clear all log and start sync_gateway 16:41 as indicated by sg_err.log and sg_warn.log
      ethernet is disconnect at 17:16, sg_info.log stop recording at 17:18
      ethernet is reconnect at 20:44, no replication is resume, as no now info is logged in sg_info.log
      sync_gateway is survive through entire test, as sg_stats.log last to 21:15

Did you find the similiar problems? What might possibly goes wrong ? Is there any work around?
Hope to hear from you soon.

Regards.

disconnect for 0.5~2.5 HR, replication will stop

Assuming that you are using the continuos mode and the error from disconecting the ethernet cable is one of the transient errors. The expected behavior is that that the replicator will be under its retry cycle and wait to retry again. When disconnecting for 0.5=2.5 HR, by default, the replicator will wait for 5 mins (300 seconds) before retrying again. However, if the error when trying to connect is a permament error, the replicator will stop.

Can you enable verbose logging and see the error when the replicator stops? Sharing the log would be very helpful.

  1. the replicator activity level may stuck at any one of OFFLINE, CONNECTING, IDLE, BUSY

Enable verbose log and sharing the log would be helpful. We have fixed some issues related to this problem from time to time, updating to use the lastest CBL version might fix the problem.

Dear pasin,

Thanks for your prompt reply.
We are using lite C++, which come with the DEB package.
In Ubuntu, the cblite C is installed by
"sudo apt install libcblite-comminity "
and
"sudo apt install libcblite-dev-comminity "

An embarrassing question:
How to enable verbose log in C++ api?
There is no clear document on enabling log.

Thanks for your help very much.

Regards.

You can call CBLLog_SetConsoleLevel(kCBLLogVerbose) to eanble verbose logging. The default log level is kCBLLogWarning.

Dear Pasin,

As your instruction, verbal log is enabled by adding
CBLLog_SetConsoleLevel(kCBLLogVerbose)

Since this experiment lasts more than 4 hours, and the log files are huge, the files are shared with you by Google Drive.
https://drive.google.com/drive/folders/1UDPBFDU22mt9w8Lo7WEH6XOU5ixtTOfq?usp=sharing

There are two zip files in the Google Drive shared folder:

sg_log_xxx.zip ==> log from Sync Gateway

  • sg_err.log
  • sg_err.log
  • sg_err.log
  • sg_err.log

cbl_verb_log_xxx.zip ==> console log of our couchbase lite application

  • cbl_verb.log ==> from application start, break network connection (for 2.5HR), reconnect network (then wait another 1HR)
  • cbl_verb_restart.log ==> Stop previous CBL application, start and log. Replication back to normal.
  • psLog_after_netbreak ==> the output of ps command after network break.
  • psLog_after_reconnect ==> the output of ps command after re-connect network.

One thing must address:

The time stamp generated by couchbase lite log is 8 hours ahead our local time. As shown in the first line of the logs, local time (generated by date commad) is 2023-09-21, Thursday, 15:09:08, but verbose stamp the time as 23:09:08.

Here we lists the time event in “couchbase lite verbose” time, and local time inside parentheses.

  • 23:09:08 (15:09:08) start CouchbaseLite application
  • 23:24:05 (15:24:05) unplug the Ethernet cable of couchbase server machine ( which also run sync gateway)
  • 02:02:00(18:02:00) re-plug the Ethernet cable of couchbase server machine.
  • 03:01:36(1901:36) re-start couchbase lite application. Console log in cbl_verb_restart.log

As shown in the log file, 23:40:05 (16 min after un-plug the Ethernet cable) is the last [Sync] related verbal log. Even after re-plug the Ethernet cable, there are no [Sync] related message.

Plesae help us to solve this issue.
Thank you very much.

I have looked at the cbl_verb.log and I could see the same thing that the replicator seems to stop working after starting an attempt (attempt #4) to connect to SG. It’s strange that there was no log indicating that a BuiltInWebSocket is trying to connect either.

I have filed a ticket : CBL-4933

The only workaround I could think of is to listen to the replicator change event. If there is no events reports for a specific of time after the replicator went to offline or connecting status, just restart the replicator (stopped the current one and restarted it).

Dear Pasin,

If there is no events reports for a specific of time after
the replicator went to offline or connecting status,
just restart the replicator (stopped the current one and restarted it).

We tried similar things in a house keeping thread, as shown below.
It does not work. Either set host reachable or stop/start replicator, does not work.

=========================== begin quote ==============
if (r.status().activity== kCBLReplicatorIdle){
idle_count++;
if (idle_count> 2010) { // idle for more than 10 minutes, 20 counts per minutes
idle_count =0;
need_action= true;
}
} else {
idle_count=0;
}
if (r.status().activity== kCBLReplicatorConnecting){
connecting_count++;
if (connecting_count> 20
10) { // idle for more than 10 minutes, 20 counts per minutes
connecting_count =0;
need_action= true;
}
} else {
connecting_count=0;
}
if (r.status().activity== kCBLReplicatorOffline){
offline_count++;
if (offline_count> 20*10) { // idle for more than 10 minutes, 20 counts per minutes
offline_count =0;
need_action= true;
}
} else {
offline_count=0;
}

if (need_action) {
need_action = false;
#if (0)
r.setHostReachable(false);
QThread::currentThread()->msleep(100010); // 10 seconds
r.setHostReachable(true);
#else
r.stop();
QThread::currentThread()->msleep(1000
60*10); // 10 seconds
r.start(false);
#endif

==================== end quote ==================

That is really weird. I guessed that some internal flags are off and that prevents the replicator to actually restart. What if you re-create a new replicator?

Dear Pasin,

By CBLReplicator_Create(), we know how to create a “New” replicator.

  1. but how to delete(free) the old replicator?

  2. do we need to
    rep->stop();

    free(rep)

Regards.

Yes.

CBLListener_Remove(token); // If listening to the replicator change events
CBLReplicator_Stop(rep);
CBLReplicator_Release(rep);

Dear Pasin,

any progress on this issue <>?

Thanks

I picked up the issue last week and tried to reproduce the issue on my mac but I couldn’t reproduce. I have reviewed the code comparing to the log. I am guessing that the replicator is somehow waiting to get a lock to check for any pending conflicts that need to be resolve but I just don’t have enough info to support the guess.

Can you reprodue the issue in your dev enviroment that you can get a full backtrace of all threads when the replicator hang while trying to start?

Without being to reproduce the issue nor getting the traces, it is hard to see where the problem is.

Dear Pasin,

Can you reprodue the issue in your dev enviroment that you can get a
full backtrace of all threads when the replicator hang while trying to start?

Do you mean: run our application, repeat our testing procedure ( unplug network cable for 0.5~2.5 HR , then re-plug), and break the program execution in debugger mode, then print back trace of all replicator threads? If yes, we will do the test.

Let me arrange the test ASAP.

Regards.