Some unexpected findings in CBL 2.x

benjamin_glatzeder · October 2, 2019, 1:48pm

CBL 2.6 Android
SG 2.6
CB Server 6.0 CE

Below are some unexpected findings. I put them in one post instead of creating several individual posts at once.

1) ReplicatorChangeListener

Adding a one-off listener does never fire, IF a replicator listener was not created before the replicator was started.

// in some class
tokenReplicationListener = MyCBLUtilClass.myReplicator.addChangeListener(new ReplicatorChangeListener() {
            @Override
            public void changed(@NonNull ReplicatorChange change) {
                Log.d(TAG, "in change status = " + change.getStatus().getActivityLevel());
                // once the correct change event was observerd, remove one-off listener
            }
        });

// in MyCblUtilClass
replicator = new Replicator(myConfig);
// If this listener is not added, one-off listeners added in other classes won't fire. 
// If this listeners is added, one-off listeners in other classes do fire
tokenReplicationListener = replicator.addChangeListener(change -> {
// code to handle events or no code at all
});
replicator.start();

Expected: A listener can be added at any moment and fire change events.

2) DocumentReplicationListener

Same as 1) but with DocumentReplicationListener

3) Purging documents

Purging documents does not fire events in a QueryChangeListener of a query. Create/Update/Delete events are fired. I believe this changed from 1.x to 2.x. I believe the old LiveQuery received purge events.
Expected: QueryChangeListener fires when documents are purged.

4) Session timeout with custom authentication and replicator

To test if the continuous replication stops once the session cookie expired I set the session TTL to 60 seconds. It does continue after the session time ran out and even after the bucket was compacted and the _sync:session:XXX document was compacted away.
Expected: Continuous replication stops once the session expired.

5) Updating rejected document by SG

Let me preface this issue that I could not consistently reproduce it. The scenario is:

Client A updates document with an invalid value. SG rejects it. All good so far.
Client B updates the same document with good values. SG accepts it.
Client A does not get the new updated document with only good values right away. Sometimes it works the first time. Sometimes Client B needs to update the document several times until it was synced to Client A

I can easily work around the findings but wanted to bring awareness to them. If all works as expected then it’s documented here now as I didn’t find it in the documentation.

Thanks!

blake.meike · October 2, 2019, 5:43pm

WRT #1 and #2, I would be interested in seeing code that could demonstrate your assertions. Adding a listener after a process starts is certainly a race. I would like to see evidence that a single listener, added after the, e.g., replicator starts, but before it is finished, does not receive notifications.

borrrden · October 2, 2019, 9:43pm

#4 is known but it requires work on the remote side to do correctly. Unlike 1.x, 2.0 and higher uses web sockets for its connections which means it only authenticates once at the beginning so if the TTL has not passed yet then you will authenticate successfully and stay connected until you disconnect for whatever reason (go offline, stop replication, etc).

#5 Sounds like a product of default conflict resolution. If each side only updated once then the winner of the conflict is pretty arbitrary. However if the other side updated more times then that side will win for sure. Otherwise it’s kind of a coin toss as to which gets chosen to be the winner. This blog post dives into more detail on that in the “Pull Replication” section.

benjamin_glatzeder · October 3, 2019, 5:36am

@blake.meike I’m happy to create a demo project and share it here. Before I do so please let me specify the issue. A scenario might be in a chat app. There a user leaves a chat room and an automated message is sent: “User xyz left the room”. To make sure the message is synced up before running other code let’s attach a DocumentReplicationListener and observe the change. The “other code” then could be to remove the chat room’s channel from the user via SG REST API. With this scenario I showed that there are use cases to have a one-off DocumentReplicationListener and that the Listener can be attached minutes after the replication started, .i.e when the user leaves a chat room by clicking a button.
If you were not talking about a race condition or if providing a sample project still makes sense, I’m happy to do so.

@borrrden #4

#5 I forgot to mention this. Yes, I now use the automated conflict resolution from CBL 2. Thanks for the explanation. This makes sense.

blake.meike · October 3, 2019, 3:59pm

@benjamin_glatzeder Yes, I would like to see an example that drives the behavior you describe. I certainly agree that if you attach a listener to a replicator after the replication is complete, that the listener will not receive notifications. If there is a case in which attaching the listener before the replication is complete does not cause the listener to receive state changes, that is a bug. I’ve looked at the code and, just by inspection, I don’t see how that could happen. … but that is exactly why bugs happen. If you can drive the behavior you describe, I’d like to see it. You will have to demonstrate, somehow, that the replication is not complete, when the listener is attached…
… and, FWIW, this seems like an odd concern to me. Even if there is a bug that causes a listener that is attached late to fail to receive notifications, attaching late is a race: sometimes you will get the listener registered before the replicator completes, sometimes you won’t. To avoid the race, attach the listener before starting the replicator.

jens · October 3, 2019, 10:25pm

For items 1 and 2, is this what you’re saying: That if no listener is added to a Replicator before it starts, then listeners added afterwards will never be called? That definitely sounds like a bug.

3 sounds like a bug … I think this happens because purging does not create a sequence, so it doesn’t increment the database’s current sequence number, and that may cause the query listener to think that nothing’s changed.

5 also sounds like a bug: a conflict between A’s local revision that was rejected, and the new incoming revision from B. I think A will keep deciding that its local revision wins the conflict, and ignoring the remote one. We may need to do some special handling for revisions that are known to have been rejected by the server.

jens · October 3, 2019, 10:31pm

I’ve filed issues CBL-444 to investigate 3, and CBL-445 for 5.

benjamin_glatzeder · October 7, 2019, 2:53pm

I will get to the demo app with 1) and 2) this week. Thank you for you patience.

benjamin_glatzeder · October 11, 2019, 4:22pm

I’m currently creating the demo app. It fails at using the BasicAuthorization

configuration.setAuthenticator(new BasicAuthenticator("test_user", "pass1234"));

The user exists which I created and checked with SG admin API

curl -X GET "http://localhost:4985/my_database/_user/test_user" -H  "accept: application/json"
{"name":"test_user","all_channels":["!"]}

And no authorization while allowing guest users fails, too.

"users": {
        "GUEST": {
          "disabled": false,
          "admin_channels": [
            "*"
          ]
        }
      }

This might be because of sg_info.log

2019-10-11T18:11:45.461+02:00 [INF] Reset guest user to config
2019-10-11T18:11:45.461+02:00 [INF] Starting admin server on :4985
2019-10-11T18:11:45.467+02:00 [INF] Starting server on :4984 ...
2019-10-11T18:11:45.492+02:00 [INF] CBGoUtilsLogger: Using plain authentication for user <ud>sync_gateway_test</ud>
2019-10-11T18:11:45.492+02:00 [INF] CBGoUtilsLogger: Using plain authentication for user <ud>sync_gateway_test</ud>
2019-10-11T18:11:45.522+02:00 [INF] DCP: Backfill in progress: 9% (1 / 11)

Backfill does not complete. It just sits there. Restarting the SG service or the machine did not help. Any pointers to use the BasicAuth or no auth at all are greatly appreciated. I use custom auth in production and using it in the demo app would add more dependencies, i.e. a web app authorizing the user.

The error message displayed in Android Studio in both cases is:

com.test.cbldemo E/CouchbaseLite/REPLICATOR: {Repl#9} Got LiteCore error: WebSocket error 1001 "WebSocket connection closed by peer"
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err: CouchbaseLiteException{CouchbaseLite,11001,'WebSocket connection closed by peer'}
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.CBLStatus.convertException(CBLStatus.java:81)
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.CBLStatus.convertException(CBLStatus.java:32)
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.CBLStatus.convertException(CBLStatus.java:38)
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.AbstractReplicator.updateStateProperties(AbstractReplicator.java:872)
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.AbstractReplicator.c4StatusChanged(AbstractReplicator.java:569)
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.AbstractReplicator$ReplicatorListener.lambda$statusChanged$0(AbstractReplicator.java:195)
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.-$$Lambda$AbstractReplicator$ReplicatorListener$VQV3yukqXlbZQTDkVEhKdctJGcw.run(Unknown Source:4)
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.internal.AndroidExecutionService$SerialExecutor.lambda$execute$0$AndroidExecutionService$SerialExecutor(AndroidExecutionService.java:94)
2019-10-11 18:20:26.754 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.internal.-$$Lambda$AndroidExecutionService$SerialExecutor$1kVjncGsdg5bUhp-PYtBQvtsUR4.run(Unknown Source:4)
2019-10-11 18:20:26.755 7775-7775/com.test.cbldemo W/System.err:     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
2019-10-11 18:20:26.755 7775-7775/com.test.cbldemo W/System.err:     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
2019-10-11 18:20:26.755 7775-7775/com.test.cbldemo W/System.err:     at java.lang.Thread.run(Thread.java:764)
2019-10-11 18:20:26.755 7775-7775/com.test.cbldemo W/System.err: Caused by: LiteCoreException{domain=6, code=1001, msg=WebSocket connection closed by peer}
2019-10-11 18:20:26.755 7775-7775/com.test.cbldemo W/System.err:     at com.couchbase.lite.CBLStatus.convertException(CBLStatus.java:41)
2019-10-11 18:20:26.755 7775-7775/com.test.cbldemo W/System.err: 	... 9 more

jens · October 11, 2019, 4:40pm

‘WebSocket connection closed by peer’ isn’t an auth failure. It means the other end abruptly closed the TCP socket. What’s the URL you’re giving the replicator? Is anything interesting logged if you turn on sync logging?

borrrden · October 11, 2019, 4:45pm

If I recall correctly then SG won’t accept any connections until it is done with its backfill. I don’t know the reason for it not completing but I suspect @bbrks might have some ideas.

benjamin_glatzeder · October 11, 2019, 4:49pm

All logs are on in the SG config file, including debug logs. This is all I get after restarting the service:

2019-10-11T18:42:54.389+02:00 ==== Couchbase Sync Gateway/2.6.0(127;b4c828d) CE ====
2019-10-11T18:42:54.390+02:00 [INF] Logging: Console to stderr
2019-10-11T18:42:54.390+02:00 [INF] Logging: Files to /home/sync_gateway/logs
2019-10-11T18:42:54.390+02:00 [INF] Logging: Console level: info
2019-10-11T18:42:54.390+02:00 [INF] Logging: Console keys: [HTTP]
2019-10-11T18:42:54.390+02:00 [INF] Logging: Redaction level: none
2019-10-11T18:42:54.390+02:00 [INF] requestedSoftFDLimit >= currentHardFdLimit (250000 >= 65535) capping at 65535
2019-10-11T18:42:54.390+02:00 [INF] Configured process to allow 65535 open file descriptors
2019-10-11T18:42:54.390+02:00 [INF] Logging stats with frequency: 1m0s
2019-10-11T18:42:54.390+02:00 [INF] Opening db /<my_database> as bucket "test_bucket", pool "default", server <couchbase://192.168.178.72,192.168.178.80>
2019-10-11T18:42:54.390+02:00 [WRN] Using deprecated config option: "channel_cache_max_length". Use "cache.channel_cache.max_length" instead. -- rest.(*ServerContext)._getOrAddDatabaseFromConfig() at server_context.go:384
2019-10-11T18:42:54.391+02:00 [INF] GoCBCustomSGTranscoder Opening Couchbase database test_bucket on <couchbase://192.168.178.72,192.168.178.80> as user "sync_gateway_test"
2019-10-11T18:42:54.392+02:00 [INF] Auth: Attempting credential authentication couchbase://192.168.178.72,192.168.178.80?http_idle_conn_timeout=90000&http_max_idle_conns=64000&http_max_idle_conns_per_host=256&n1ql_timeout=75000
2019-10-11T18:42:54.399+02:00 [INF] Successfully opened bucket test_bucket
2019-10-11T18:42:54.410+02:00 [INF] Set query timeouts for bucket test_bucket to cluster:1m15s, bucket:1m15s
2019-10-11T18:42:54.410+02:00 [INF] Initializing indexes with numReplicas: 0...
2019-10-11T18:42:54.723+02:00 [INF] Verifying index availability for bucket test_bucket...
2019-10-11T18:42:54.736+02:00 [INF] Indexes ready for bucket test_bucket.
2019-10-11T18:42:54.736+02:00 [INF] delta_sync enabled=false with rev_max_age_seconds=86400 for database lister_production_database
2019-10-11T18:42:54.737+02:00 [INF] Created background task: "CleanAgedItems" with interval 1m0s
2019-10-11T18:42:54.737+02:00 [INF] Cache: Initializing changes cache with options {ChannelCacheOptions:{ChannelCacheMinLength:50 ChannelCacheMaxLength:1000 ChannelCacheAge:1m0s MaxNumChannels:50000 CompactHighWatermarkPercent:80 CompactLowWatermarkPercent:60} CachePendingSeqMaxWait:5s CachePendingSeqMaxNum:10000 CacheSkippedSeqMaxWait:1h0m0s}
2019-10-11T18:42:54.737+02:00 [INF] Initializing changes cache for database lister_production_database
2019-10-11T18:42:54.737+02:00 [INF] Created background task: "InsertPendingEntries" with interval 2.5s
2019-10-11T18:42:54.737+02:00 [INF] Created background task: "CleanSkippedSequenceQueue" with interval 30m0s
2019-10-11T18:42:54.737+02:00 [INF] DCP: Starting mutation feed on bucket test_bucket due to either channel cache mode or doc tracking (auto-import/bucketshadow)
2019-10-11T18:42:54.737+02:00 [INF] DCP: Using DCP feed for bucket: "test_bucket" (based on feed_type specified in config file)
2019-10-11T18:42:54.738+02:00 [INF] DCP: Using DCP Logging Receiver.
2019-10-11T18:42:54.777+02:00 [INF] DCP: Previously persisted backfill sequences found - will resume
2019-10-11T18:42:54.777+02:00 [INF] DCP: Restarting vb 0 using metadata sequence 0  (backfill 0 not in [0-0])
2019-10-11T18:42:54.777+02:00 [INF] DCP: Restarting vb 1 using metadata sequence 0  (backfill 0 not in [0-0])
2019-10-11T18:42:54.778+02:00 [INF] DCP: Restarting vb 2 using metadata sequence 0  (backfill 0 not in [0-0])
2019-10-11T18:42:54.778+02:00 [INF] DCP: Restarting vb 3 using metadata sequence 0  (backfill 0 not in [0-0])
...
2019-10-11T18:42:55.189+02:00 [INF] DCP: Restarting vb 1022 using metadata sequence 0  (backfill 0 not in [0-0])
2019-10-11T18:42:55.189+02:00 [INF] DCP: Restarting vb 1023 using metadata sequence 0  (backfill 0 not in [0-0])
2019-10-11T18:42:55.204+02:00 [INF] Using metadata purge interval of 3.00 days for tombstone compaction.
2019-10-11T18:42:55.204+02:00 [INF] Created background task: "Compact" with interval 24h0m0s
2019-10-11T18:42:55.211+02:00 [INF] Reset guest user to config
2019-10-11T18:42:55.211+02:00 [INF] Starting admin server on :4985
2019-10-11T18:42:55.218+02:00 [INF] Starting server on :4984 ...
2019-10-11T18:42:55.228+02:00 [INF] CBGoUtilsLogger: Using plain authentication for user <ud>sync_gateway_test</ud>
2019-10-11T18:42:55.228+02:00 [INF] CBGoUtilsLogger: Using plain authentication for user <ud>sync_gateway_test</ud>
2019-10-11T18:42:55.257+02:00 [INF] DCP: Backfill in progress: 7% (1 / 14)

url is: URI syncGatewayURI = new URI("ws://192.168.178.68:4984/<my_database>");

There is 3 seperate machines for loadbalancer, SG, Couchbase Server nodes. Hence 3 different local IPs.

benjamin_glatzeder · October 12, 2019, 10:53am

Issue with SG sync is fixed. Any Android app needs internet permission and since Android 8 (?) it needs to allow unecrypted network traffic. With that out of the way please check the attached demo app.

CblDemo.zip (818.8 KB)

Short description:
Button 1) as the name suggests
Button 2) as the name suggests
Button 3) attach adhoc listeners with in MainActivity
Button 4) as the name suggests

I was wrong about the ReplicationChangeListener. This always fires! But the DocumentChangeListener only fires if it is attached before the replication is started.

@blake.meike I have to agree. There is no guarantee that an ad hoc listener is attached before the change event happened. I’d say if you ever need a replicatorChangeListener or documentChangeListener they should be attached before starting the replication. I will still keep the listeners in CouchbaseUtil.class empty as they don’t need to process events all the time. Only when the business logic requires. Then I’ll attach an ad hoc listener.

benjamin_glatzeder · January 24, 2020, 3:03pm

This is fixed in CBL 2.7.

Topic		Replies	Views
Some documents were not synced at initial sync. After trying 10x it worked without making any changes Couchbase Lite	20	6575	October 26, 2016
Sync Gateway replication missing document Couchbase Lite java	18	1097	December 5, 2022
Replicator in Android get stuck at busy state when it goes back to online from offline Couchbase Lite java	34	5178	February 4, 2022
[2.0/2.1] Can't stop Replication Couchbase Lite dot-net	6	1459	September 18, 2018
Sync gateway not syncing documents with cb lite, replicator stuck on busy Sync Gateway	34	1650	November 6, 2023