Couchbase Lite C stop replicate after network interruption

DustinWen · September 11, 2023, 4:33am

Dear Sir,

We got a problem to replacte data from Couchbase Lite C to Couchbase server.

The setup:
1. The server is Couchbase Server Community 7.0 ~7.2, run on Machine A
2. SyncGateway Community is 2.8~3.03, run on Machine A
3. the Couchbase Lite C Community is 3.0~3.1. Continuous replicate mode; push only; documents expire after 48 hours, run on Machine B
4. both Machine A and Machine B are Ubuntu Desktop 20.04LTS, with fixed ip address assignment.
Normal Operation:
We push only data from local DB to server, everything is ok,
Abormal Behavior:
After confirm replcation is working properly, we disconnect the Ethernet of Machine A
then reconnect Etherdnet of Machine A
disconnect for short time, replication will work ok.
disconnect for 0.5~2.5 HR, replication will stop
What we found:
1. the replicator activity level may stuck at any one of OFFLINE, CONNECTING, IDLE, BUSY
2. couchbase lite c application program, replicator stop start cycle will not resume the replication
3. stop-start the couchbase lite c application program will resume the replication
4. restart the sync gateway will not resume the replication.
5. here is a “ls -l ~sync_gateway/logs/bootstrap” of our test:
  166 Sep 7 16:41 sg_error.log
  3711825 Sep 7 17:18 sg_info.log
  6870131 Sep 7 21:15 sg_stats.log
  166 Sep 7 16:41 sg_warn.log
  clear all log and start sync_gateway 16:41 as indicated by sg_err.log and sg_warn.log
  ethernet is disconnect at 17:16, sg_info.log stop recording at 17:18
  ethernet is reconnect at 20:44, no replication is resume, as no now info is logged in sg_info.log
  sync_gateway is survive through entire test, as sg_stats.log last to 21:15

Did you find the similiar problems? What might possibly goes wrong ? Is there any work around?
Hope to hear from you soon.

Regards.

pasin · September 11, 2023, 5:13am

disconnect for 0.5~2.5 HR, replication will stop

Assuming that you are using the continuos mode and the error from disconecting the ethernet cable is one of the transient errors. The expected behavior is that that the replicator will be under its retry cycle and wait to retry again. When disconnecting for 0.5=2.5 HR, by default, the replicator will wait for 5 mins (300 seconds) before retrying again. However, if the error when trying to connect is a permament error, the replicator will stop.

Can you enable verbose logging and see the error when the replicator stops? Sharing the log would be very helpful.

the replicator activity level may stuck at any one of OFFLINE, CONNECTING, IDLE, BUSY

Enable verbose log and sharing the log would be helpful. We have fixed some issues related to this problem from time to time, updating to use the lastest CBL version might fix the problem.

DustinWen · September 12, 2023, 3:34pm

Dear pasin,

Thanks for your prompt reply.
We are using lite C++, which come with the DEB package.
In Ubuntu, the cblite C is installed by
"sudo apt install libcblite-comminity "
and
"sudo apt install libcblite-dev-comminity "

An embarrassing question:
How to enable verbose log in C++ api?
There is no clear document on enabling log.

Thanks for your help very much.

Regards.

pasin · September 12, 2023, 4:06pm

You can call CBLLog_SetConsoleLevel(kCBLLogVerbose) to eanble verbose logging. The default log level is kCBLLogWarning.

DustinWen · September 21, 2023, 12:52pm

Dear Pasin,

As your instruction, verbal log is enabled by adding
CBLLog_SetConsoleLevel(kCBLLogVerbose)

Since this experiment lasts more than 4 hours, and the log files are huge, the files are shared with you by Google Drive.
https://drive.google.com/drive/folders/1UDPBFDU22mt9w8Lo7WEH6XOU5ixtTOfq?usp=sharing

There are two zip files in the Google Drive shared folder:

sg_log_xxx.zip ==> log from Sync Gateway

sg_err.log
sg_err.log
sg_err.log
sg_err.log

cbl_verb_log_xxx.zip ==> console log of our couchbase lite application

cbl_verb.log ==> from application start, break network connection (for 2.5HR), reconnect network (then wait another 1HR)
cbl_verb_restart.log ==> Stop previous CBL application, start and log. Replication back to normal.
psLog_after_netbreak ==> the output of ps command after network break.
psLog_after_reconnect ==> the output of ps command after re-connect network.

One thing must address:

The time stamp generated by couchbase lite log is 8 hours ahead our local time. As shown in the first line of the logs, local time (generated by date commad) is 2023-09-21, Thursday, 15:09:08, but verbose stamp the time as 23:09:08.

Here we lists the time event in “couchbase lite verbose” time, and local time inside parentheses.

23:09:08 (15:09:08) start CouchbaseLite application
23:24:05 (15:24:05) unplug the Ethernet cable of couchbase server machine ( which also run sync gateway)
02:02:00(18:02:00) re-plug the Ethernet cable of couchbase server machine.
03:01:36(1901:36) re-start couchbase lite application. Console log in cbl_verb_restart.log

As shown in the log file, 23:40:05 (16 min after un-plug the Ethernet cable) is the last [Sync] related verbal log. Even after re-plug the Ethernet cable, there are no [Sync] related message.

Plesae help us to solve this issue.
Thank you very much.

pasin · September 21, 2023, 10:25pm

I have looked at the cbl_verb.log and I could see the same thing that the replicator seems to stop working after starting an attempt (attempt #4) to connect to SG. It’s strange that there was no log indicating that a BuiltInWebSocket is trying to connect either.

I have filed a ticket : CBL-4933

The only workaround I could think of is to listen to the replicator change event. If there is no events reports for a specific of time after the replicator went to offline or connecting status, just restart the replicator (stopped the current one and restarted it).

DustinWen · September 22, 2023, 1:59am

Dear Pasin,

If there is no events reports for a specific of time after
the replicator went to offline or connecting status,
just restart the replicator (stopped the current one and restarted it).

We tried similar things in a house keeping thread, as shown below.
It does not work. Either set host reachable or stop/start replicator, does not work.

=========================== begin quote ==============
if (r.status().activity== kCBLReplicatorIdle){
idle_count++;
if (idle_count> 2010) { // idle for more than 10 minutes, 20 counts per minutes
idle_count =0;
need_action= true;
}
} else {
idle_count=0;
}
if (r.status().activity== kCBLReplicatorConnecting){
connecting_count++;
if (connecting_count> 2010) { // idle for more than 10 minutes, 20 counts per minutes
connecting_count =0;
need_action= true;
}
} else {
connecting_count=0;
}
if (r.status().activity== kCBLReplicatorOffline){
offline_count++;
if (offline_count> 20*10) { // idle for more than 10 minutes, 20 counts per minutes
offline_count =0;
need_action= true;
}
} else {
offline_count=0;
}

if (need_action) {
need_action = false;
#if (0)
r.setHostReachable(false);
QThread::currentThread()->msleep(100010); // 10 seconds
r.setHostReachable(true);
#else
r.stop();
QThread::currentThread()->msleep(100060*10); // 10 seconds
r.start(false);
#endif

==================== end quote ==================

pasin · September 22, 2023, 2:53am

That is really weird. I guessed that some internal flags are off and that prevents the replicator to actually restart. What if you re-create a new replicator?

DustinWen · September 23, 2023, 1:21pm

Dear Pasin,

By CBLReplicator_Create(), we know how to create a “New” replicator.

but how to delete(free) the old replicator?
do we need to
rep->stop();

free(rep)

Regards.

pasin · September 23, 2023, 5:52pm

Yes.

CBLListener_Remove(token); // If listening to the replicator change events
CBLReplicator_Stop(rep);
CBLReplicator_Release(rep);

DustinWen · November 16, 2023, 12:56am

Dear Pasin,

any progress on this issue <>?

Thanks

pasin · November 20, 2023, 9:26pm

I picked up the issue last week and tried to reproduce the issue on my mac but I couldn’t reproduce. I have reviewed the code comparing to the log. I am guessing that the replicator is somehow waiting to get a lock to check for any pending conflicts that need to be resolve but I just don’t have enough info to support the guess.

Can you reprodue the issue in your dev enviroment that you can get a full backtrace of all threads when the replicator hang while trying to start?

Without being to reproduce the issue nor getting the traces, it is hard to see where the problem is.

DustinWen · December 6, 2023, 7:41am

Dear Pasin,

Can you reprodue the issue in your dev enviroment that you can get a
full backtrace of all threads when the replicator hang while trying to start?

Do you mean: run our application, repeat our testing procedure ( unplug network cable for 0.5~2.5 HR , then re-plug), and break the program execution in debugger mode, then print back trace of all replicator threads? If yes, we will do the test.

Let me arrange the test ASAP.

Regards.

system · March 5, 2024, 7:42am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Do CBLite replications give up reconnection attempts when offline? Couchbase Lite dot-net	9	1274	June 15, 2018
Q: Continuous replicator halt, the sync stop Mobile	5	1035	May 29, 2019
Replicator in Android get stuck at busy state when it goes back to online from offline Couchbase Lite java	34	5500	February 4, 2022
Sync Gateway replication missing document Couchbase Lite java	18	1311	December 5, 2022
CBL2.0 Replication Usage Couchbase Lite	7	1129	January 2, 2018

Couchbase Lite C stop replicate after network interruption

Related topics