Possible bug in handling of an app crash during sync

I’ve got something weird with the sync engine when there is an app crash.
It’s been around for a long time, but was rare enough that I ignored it.
It turns out this happens most of the time if the app crashes during startup.

Environmental Details:

  • Client: MacOS Ventura 13.5
  • Language/Framework: C#, .Net Core 7
  • CB Lib: Couchbase.Lite 3.1.1, Couchbase.Lite.Support.NetDesktop 3.1.1

How to reproduce:

  • Have a cb database with an account with a reasonable amount of data (in my case 94 MB with 260 documents). I can reproduce this with less - it just needs to be enough to take time to sync.
  • Start up my app on a machine. First time run so there is no local db.
  • Collect login credentials.
  • Local DB is created. The sync engine starts. I can see syncing start, and data begins to arrive.
  • Kill the app (using kill -9 or in the debugger).

Now restart the app.

  • Collect login credentials
  • The sync engine starts.
  • The sync engine pulls no data, it sends a SyncUpdate message of “Idle”.

There seems to be no way to tell the difference between an incomplete sync and a normal complete sync, that is I don’t get any indication of the failure. Further the only way to fix this is to delete the local db and restart the sync from scratch. That is, restarting the app doesn’t fix this. Once it believes sync is complete, it is always complete.

Logging
I have been collecting DB logs using a class that implements ILogger. In my case, I just dump the log entries into nlog using a shameless copy of your sample code:

    class DBSyncLogger : Couchbase.Lite.Logging.ILogger
    {
        private static Logger log = LogManager.GetCurrentClassLogger();

        public Couchbase.Lite.Logging.LogLevel Level { get; set; }

        public void Reset()
        {
        }

        public void Log(Couchbase.Lite.Logging.LogLevel level, LogDomain domain, string message)
        {
            switch (level)
            {
                case Couchbase.Lite.Logging.LogLevel.Info:
                    log.Info(message);
                    break;
                case Couchbase.Lite.Logging.LogLevel.Error:
                    log.Error(message);
                    break;
                case Couchbase.Lite.Logging.LogLevel.Warning:
                    log.Warn(message);
                    break;
                case Couchbase.Lite.Logging.LogLevel.Debug:
                    log.Debug(message);
                    break;
                case Couchbase.Lite.Logging.LogLevel.Verbose:
                    log.Trace(message);
                    break;
            }
        }
    }

I start the logging using the following code:

#if ENABLEDBLOGGING
            Database.Log.Custom = new DBSyncLogger { Level = Couchbase.Lite.Logging.LogLevel.Warning };
            Database.Log.Console.Domains = LogDomain.All;
            Database.Log.Console.Level = Couchbase.Lite.Logging.LogLevel.Verbose;
#endif

I ran my app, killed it, then ran it again (as outlined above).
I have attached the output of the second run (with as much extraneous logging information removed). You can see the sync being started up, the replicator connecting to the remote end, it doing a few reads/writes and SyncUpdate messages (to my code), then it goes into idle state with hundreds of docs unsynced.

I sure hope I’m doing something wrong during sync. Here is the code that kicks off the sync:

            this.syncCallBack = syncCallBack;
            var config = AppConfig.Instance;
            var url = new Uri(config.sync_url);
            var target = new URLEndpoint(url);
            var auth = new BasicAuthenticator(id, secure_server_password);
            var collection = database.GetCollection("_default");
            var collection_configuration = new CollectionConfiguration()
            {
                ConflictResolver = new SyncConflictResolver()
            };
            var sync_config = new ReplicatorConfiguration(target)
            {
                ReplicatorType = ReplicatorType.PushAndPull,
                Continuous = true,
                Authenticator = auth,
                //EnableAutoPurge = true
            }; 
            sync_config.AddCollection(collection, collection_configuration);
            replicator = new Replicator(sync_config);
            listenerToken = replicator.AddChangeListener(SyncStatusUpdate);

            log.Debug("StartSyncAgent: Using url: {0}, id: {1}".Fmt(url, id));
            replicator.Start();

Any thoughts/comments would be most appreciated.

p.s. I see this sometimes later during application execution. That is, the app starts and completes first sync. Then later the app crashes (presumably during a later sync), then it never syncs any more documents, it just says idle. I think this is the same instance of the problem, but haven’t proved it.

p.p.s I’m not seeing any errors that could indicate a locally corrupted db.

p.p.p.s if there are no crashes, then syncing works fine. Most of the time if there is an app crash after the initial sync completes (and perhaps when syncing isn’t happening), then syncing continues to work fine.

Thank you
Paul.

Hmmm. looks like you can’t upload a txt file to attach to the report.
So here is the log file I mention above:
[Apologies that this has got so long… but it needed to be]

12723|1|15:40:58.7814|Trace|CoreLibrary.ServerManager.StartSyncAgent|StartSyncAgent
12723|1|15:40:58.7913|Debug|CoreLibrary.ServerManager.StartSyncAgent|StartSyncAgent: Using url: wss://URL:PORT/DB, id: THE_ID

12723|1|15:40:58.8028|Info|CoreLibrary.DBSyncLogger.Log|(Replicator) [1] Replicator[<*> wss://URL:PORT/DB]: Starting
12723|1|15:40:58.8062|Info|CoreLibrary.DBSyncLogger.Log|(Replicator) [1] Replicator[<*> wss://URL:PORT/DB] is Connecting, progress 0/0
12723|1|15:40:58.8069|Info|CoreLibrary.DBSyncLogger.Log|(Replicator) [1] Replicator[<*> wss://URL:PORT/DB] is Connecting, progress 0/0

12723|21|15:40:58.9090|Trace|CoreLibrary.ServerManager.SyncStatusUpdate|Completed. Status now Connecting. message Connecting

12723|1|15:40:58.8384|Info|CoreLibrary.DBSyncLogger.Log|(XQuery) [1] Query encoded as {"WHERE":["AND",["=",[".type"],"Envelope"],["=",[".read"],false]],"WHAT":[["._id"]],"FROM":[{"SCOPE":"_default","COLLECTION":"_default","AS”:”DB”}]}

12723|1|15:40:58.8384|Debug|AppMac.MainContentVC.SyncCallback|Sync Status now 'NotStarted': ''

12723|22|15:40:58.8664|Info|CoreLibrary.DBSyncLogger.Log|(QueryResultSet) [.NET Long Running Task (22)] Beginning query enumeration (600000ab1510)
12723|22|15:40:58.8664|Info|CoreLibrary.DBSyncLogger.Log|(QueryResultSet) [.NET Long Running Task (22)] End of query enumeration (600000ab1510)
12723|13|15:40:58.8722|Trace|CoreLibrary.DBSyncLogger.Log|(SerialQueue) [13] DispatchAsync Void <Start>b__31_0()
12723|23|15:40:58.8746|Info|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (23)] c4Socket is retained, and reachability status monitor is starting.

12723|21|15:40:58.9090|Trace|CoreLibrary.ServerManager.SyncStatusUpdate|Completed. Status now Connecting. message Connecting

12723|11|15:40:59.0838|Trace|CoreLibrary.DBSyncLogger.Log|(SerialQueue) [.NET ThreadPool Worker (11)] DispatchAsync Void StartInternal()
12723|25|15:40:59.1070|Info|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (25)] WebSocket connecting to <URL:PORT/DB>

12723|12|15:40:59.5595|Trace|CoreLibrary.DBSyncLogger.Log|(SerialQueue) [.NET ThreadPool Worker (12)] DispatchAsync Void OnSocketReady()
12723|12|15:40:59.5649|Trace|CoreLibrary.DBSyncLogger.Log|(SerialQueue) [.NET ThreadPool Worker (12)] DispatchAsync Void HandleHTTPResponse()
12723|22|15:40:59.5674|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (22)] WebSocket sent HTTP request...
12723|12|15:40:59.7975|Info|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET ThreadPool Worker (12)] WebSocket CONNECTED!
12723|21|15:40:59.7998|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (21)] c4Socket got http response.
12723|21|15:40:59.8004|Info|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (21)] c4Socket is open.
12723|14|15:40:59.8806|Info|CoreLibrary.DBSyncLogger.Log|(Replicator) [14] Replicator[<*> URL:PORT/DB] is Busy, progress 0/0

12723|19|15:40:59.9463|Trace|CoreLibrary.ServerManager.SyncStatusUpdate|Completed. Status now Idle. message Sync is idle

12723|18|15:40:59.9425|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (18)] c4Socket completed Write.
12723|17|15:41:00.1845|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (17)] <<< received 93 bytes [now 93 pending]
12723|16|15:41:00.1907|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (16)] c4Socket received.
12723|15|15:41:00.2678|Info|CoreLibrary.DBSyncLogger.Log|(Replicator) [15] Replicator[<*> URL:PORT/DB] is Busy, progress 0/102016
12723|27|15:41:00.3198|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (27)] c4Socket completed Write.
12723|27|15:41:00.3198|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (27)] c4Socket completed Write.
12723|27|15:41:00.3198|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (27)] c4Socket completed Write.
12723|26|15:41:00.3286|Trace|CoreLibrary.ServerManager.SyncStatusUpdate|Completed. Status now InProgress. message Sync in progress
12723|28|15:41:00.5140|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (28)] <<< received 9 bytes [now 9 pending]
12723|29|15:41:00.5451|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (29)] c4Socket received.
12723|31|15:41:00.6948|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (31)] <<< received 23 bytes [now 23 pending]
12723|32|15:41:00.7228|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (32)] c4Socket received.
12723|15|15:41:00.7739|Info|CoreLibrary.DBSyncLogger.Log|(Replicator) [15] Replicator[<*> URL:PORT/DB] is Busy, progress 102016/102016
12723|35|15:41:00.8295|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (35)] c4Socket completed Write.
12723|35|15:41:00.8295|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (35)] c4Socket completed Write.
12723|34|15:41:00.8579|Trace|CoreLibrary.ServerManager.SyncStatusUpdate|Completed. Status now Idle. message Sync is idle
12723|36|15:41:01.2189|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (36)] <<< received 17 bytes [now 17 pending]
12723|37|15:41:01.2470|Trace|CoreLibrary.DBSyncLogger.Log|(WebSocketWrapper) [.NET Long Running Task (37)] c4Socket received.


The logs don’t indicate that sync is facing any issues. If there were an actual error (non connectivity related) then it would stop instead of going idle. All going idle means is that a continuous replication has reached a point at which it would stop were it a one-shot replication. It’s not an indication of the sync being “finished” because a continuous sync is never actually “finished.” I think this warrants some investigation though so if you file an issue with the complete logs that could help get things started. I would expect if the replicator goes idle and there are still documents to sync, that it would quickly go busy again.

Yeah - that’s the weird thing.
Sync on the mac goes idle, and never syncs anything more. Even when there is more, and even when the remote database gets changed (add/delete). Sync just stays idle and there is no more traffic between it and the DB. And no errors.

I have tried restarting/rebooting the server end and the client end. No change.
Sync seems to think there is no more to fetch.

I’ll file a bug report.

@borrrden Issue created here.
Sorry its a bit of a mess… But I’ve copied over my above comments.

Hi @paulr

I had a similar issue with couchbase lite on the android SDK, the issue eventually disappeared after upgrading the Sync Gateway to version 3.0.4, this was after months of back and forth in support tickets and we never actually got to the root of the issue, but it may be worth double checking your sync gateway version.

@meirrosendorff Thanks for the hint. I’ll check and report back.

@meirrosendorff I was running 3.0.3, so I upgraded to 3.1.1.
Instant panic! I downgraded to 3.1.0, same again!

Since it seems to be in the config_manager, I’m guessing there is some config
change between 3.0.3 and 3.1.0 that is causing the problem.

I downgraded to 3.0.5 and it works. So now to test your theory.

 -- base.FatalPanicHandler() at util.go:1524
2023-08-17T22:21:25.251Z ==== Couchbase Sync Gateway/3.1.0(592;2a9837d) CE ==== -- base.LogSyncGatewayVersion() at logging.go:260
2023-08-17T22:21:25.251Z ==== Couchbase Sync Gateway/3.1.0(592;2a9837d) CE ====
2023-08-17T22:21:25.436Z [ERR] Unexpected panic: runtime error: invalid memory address or nil pointer dereference - stopping process
goroutine 1 [running]:
runtime/debug.Stack()
	/home/couchbase/cbdeps/go1.19.5/src/runtime/debug/stack.go:24 +0x65
github.com/couchbase/sync_gateway/base.FatalPanicHandler()
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/base/util.go:1524 +0x49
panic({0x10f0140, 0x1e81250})
	/home/couchbase/cbdeps/go1.19.5/src/runtime/panic.go:884 +0x212
github.com/couchbase/sync_gateway/rest.(*bootstrapContext).computeMetadataID(0xc0004d6900, {0x15e97c8, 0xc0007406c0}, 0xc00003b870?, 0xc000466038)
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/rest/config_manager.go:741 +0x186
github.com/couchbase/sync_gateway/rest.(*bootstrapContext).InsertConfig(0xc0004d6900, {0x15e9758?, 0xc00003a038?}, {0xc00003b870, 0x6}, {0x126e5f1, 0x7}, 0xc000466000)
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/rest/config_manager.go:80 +0x22d
github.com/couchbase/sync_gateway/rest.(*ServerContext).migrateV30Configs(0xc00011a2a0, {0x15e9758, 0xc00003a038})
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/rest/config.go:1374 +0x259
github.com/couchbase/sync_gateway/rest.(*ServerContext).initializeCouchbaseServerConnections(0xc00011a2a0, {0x15e9758?, 0xc00003a038})
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/rest/server_context.go:1845 +0x23a
github.com/couchbase/sync_gateway/rest.SetupServerContext({0x15e9758, 0xc00003a038}, 0xc000446d80, 0x20?)
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/rest/config.go:1264 +0x399
github.com/couchbase/sync_gateway/rest.serverMainPersistentConfig({0x15e9758, 0xc00003a038}, 0xc00040f320, 0xc000446900)
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/rest/main.go:145 +0x589
github.com/couchbase/sync_gateway/rest.serverMain({0x15e9758, 0xc00003a038}, {0xc000036080, 0x4, 0x4})
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/rest/main.go:58 +0x145
github.com/couchbase/sync_gateway/rest.ServerMain()
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/rest/main.go:32 +0x45
main.main()
	/home/couchbase/jenkins/workspace/sgw-unix-build/3.1.0/community/sync_gateway/main.go:24 +0x17
 -- base.FatalPanicHandler() at util.go:1524

@meirrosendorff On 3.0.8, the problem still occurs.

  • Clear out all local data.
  • Start the app. Syncing starts.
  • Kill my app suddenly well before syncing has completed.
  • Restart the app.
    Sync goes idle.

From the bug report I made - couchbase report that it is a bug and is fixed
in sync gateway 3.0.8.1 (which doesn’t seem to be released yet) and 3.1.0.

@paulr Can you provide your database configuration so we can debug the problem on Sync Gateway 3.1.0? This probably has nothing to do with the underlying sync issues.

@torcolvin Just checking - you have asked for more info on this forum post, to which I have replied. Is your comment here regarding the panic (which I also mention above) or the sync gateway bug with loss of sync?

My comment was about the panic on 3.1 which I’ve created https://issues.couchbase.com/browse/CBG-3353 to track.

The short term workaround until we release 3.1.2 would be to add a key bucket with value of the name of the bucket to the _sync:dbconfig:default document and it will remove the panic. This is a hack fix for a situation that we did not anticipate before. This should allow you to run 3.1.1

@torcolvin That workaround worked perfectly! Thank you.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.