Sync Gateway replication missing document

Martin.l · November 15, 2022, 5:15pm

Hello,

I hoping this topic has not been created before, but I could not find one.
I’m using Couchbase Lite 3.0 for Android and I’m running into an issue, where a document is locally created without errors but it’s never replicated.

I’m using continuous mode and PUSH only configuration.
And I only reproduce this issue when changing network from 4G or 3G to wifi during replication.
In the DocumentReplicationListener, I managed to add some logs and the following documentId is actually returned in the pendingDocumentIds : fd052b01-44ec-4426-896d-cbb17b79802c

But it never gets replicated as they were no success or error for this document, and it does not exists on the cluster after the replication.

CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => ea8b5aea-7eb3-4dff-ab8e-48103b36d595
CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => fd052b01-44ec-4426-896d-cbb17b79802c
CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => 2f01c46f-1306-49ff-8d79-1bc6327e9285
CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => fe56b229-53a2-420c-81e6-e30993966590
CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => de03d0fe-658b-4114-96d4-8ca4d28d26fd
CREATE FORM SUCCESS => DOC ID => 74158f64-a2ff-4f78-88e3-7fe132026776
REPLICATOR PENDING DOCUMENT IDS => [ea8b5aea-7eb3-4dff-ab8e-48103b36d595, fd052b01-44ec-4426-896d-cbb17b79802c, fe56b229-53a2-420c-81e6-e30993966590, 2f01c46f-1306-49ff-8d79-1bc6327e9285, de03d0fe-658b-4114-96d4-8ca4d28d26fd]
REPLICATION SUCCESS => DOCUMENT ID => 74158f64-a2ff-4f78-88e3-7fe132026776
REPLICATOR PENDING DOCUMENT IDS => [fd052b01-44ec-4426-896d-cbb17b79802c, fe56b229-53a2-420c-81e6-e30993966590, 2f01c46f-1306-49ff-8d79-1bc6327e9285, de03d0fe-658b-4114-96d4-8ca4d28d26fd]
REPLICATION SUCCESS => DOCUMENT ID => ea8b5aea-7eb3-4dff-ab8e-48103b36d595
REPLICATOR PENDING DOCUMENT IDS => [2f01c46f-1306-49ff-8d79-1bc6327e9285, de03d0fe-658b-4114-96d4-8ca4d28d26fd]
REPLICATION SUCCESS => DOCUMENT ID => fe56b229-53a2-420c-81e6-e30993966590
REPLICATOR PENDING DOCUMENT IDS => []
REPLICATION SUCCESS => DOCUMENT ID => de03d0fe-658b-4114-96d4-8ca4d28d26fd
REPLICATION SUCCESS => DOCUMENT ID => 2f01c46f-1306-49ff-8d79-1bc6327e9285

I’m hoping someone can help, I can’t manage to understand why this is happening.
Thank you in advance.

blake.meike · November 15, 2022, 5:41pm

Hi @Martin.l
So… can we get a little more information on this? Which version of CBL are you using? 3.0.2? 3.0.4? On what kind of device are you running the software?
Can you get the console log for the device (adb logcat). Perhaps, instead, you could us a Custom Logger and log somewhere convenient. Without the logs it is going to be pretty hard to guess at what is going on.

Martin.l · November 16, 2022, 2:14am

Hi Blake,

Thank you for the response, sorry about my custom logs.
I’m using Couchbase 3.0.0, and reproduced the issue on multiple android devices (8.1.0, 11)
For more information, i’m waiting for replication to start and then kill the app or just turn off mobile data multiple times.

I only got these logs in the error file:

22:53:42.325814| [WS] ERROR: {N8litecore4repl12C4SocketImplE#6} WebSocket failed to connect! (reason=Network error 2)
22:53:42.327769| [Sync] ERROR: {Repl#4} Got LiteCore error: Network error 2, "java.net.UnknownHostException: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname"
22:53:42.328518| [Sync] ERROR: {C4Replicator#5} State: connecting, progress=0.00%, error=Network error 2, "java.net.UnknownHostException: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname"
22:53:42.332405| [Sync] ERROR: {C4Replicator#5} Transient error (Network error 2, "java.net.UnknownHostException: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname"); attempt #2 in 2 sec...
22:53:42.333207| [Sync] ERROR: {C4Replicator#5} State: offline, progress=0.00%, error=Network error 2, "java.net.UnknownHostException: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname"

However I did found this in the info logs:

22:54:49.054690| [WS]: [JAVA] WebSocket CLOSED with error
java.net.SocketException: Software caused connection abort
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:114)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:139)
	at okio.Okio$2.read(Okio.java:140)
	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
	at okio.RealBufferedSource.request(RealBufferedSource.java:72)
	at okio.RealBufferedSource.require(RealBufferedSource.java:65)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:78)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
	at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1162)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:636)
	at java.lang.Thread.run(Thread.java:764)

22:54:49.058852| [WS]: {N8litecore4repl12C4SocketImplE#6} sent 6391033 bytes, rcvd 1068, in 13.733 sec (465390/sec, 78/sec)
22:54:49.066116| [BLIP]: {N8litecore4blip10ConnectionE#3} Closed with Network error 22: java.net.SocketException: Software caused connection abort
22:54:49.066346| [BLIP]: {N8litecore4blip6BLIPIOE#9} Notifying 3 outgoing messages they're canceled
22:54:49.066433| [Sync]: {Repl#4} Connection closed with Network error 22: "java.net.SocketException: Software caused connection abort" (state=2)
22:54:49.066533| [BLIP]: {N8litecore4blip6BLIPIOE#9} Notifying 3 incoming messages they're canceled
22:54:49.066710| [Sync]: BLIP sent 11 msgs (6391033 bytes), rcvd 3 msgs (1068 bytes) in 13.740 sec. Max outbox depth was 1, avg 1.00
22:54:49.066838| [Sync]: {N8litecore4repl6PusherE#7} activityLevel=stopped: pendingResponseCount=3, caughtUp=1, changeLists=0, revsInFlight=0, blobsInFlight=3, awaitingReply=2883, revsToSend=0, pushingDocs=3, pendingSequences=3
22:54:49.067940| [Sync]: {Repl#4} activityLevel=busy: connectionState=-1, savingChkpt=0
22:54:49.068052| [Sync]: {N8litecore4repl6PusherE#7} activityLevel=stopped: pendingResponseCount=2, caughtUp=1, changeLists=0, revsInFlight=0, blobsInFlight=3, awaitingReply=2883, revsToSend=0, pushingDocs=2, pendingSequences=0

Please tell me if you need anything more.
Thank you very much

Martin.l · November 16, 2022, 4:11pm

Hello again @blake.meike

I managed to reproduce a simple case and found a few logs that may be relevant.
For information, i’m not deleting anything locally or even on the cluster.
I have the autoPurgeEnabled set to false on the replicator config
There’s this exception:

22:55:49.603345| [DB] ***: [JAVA] StateMachine@0x53238f3: unexpected state CLOSING [CLOSED, FAILED]
java.lang.Exception
	at com.couchbase.lite.internal.utils.StateMachine.assertState(StateMachine.java:124)
	at com.couchbase.lite.internal.replicator.AbstractCBLWebSocket$OkHttpRemote.onClosing(AbstractCBLWebSocket.java:198)
	at okhttp3.internal.ws.RealWebSocket.onReadClose(RealWebSocket.java:361)
	at okhttp3.internal.ws.WebSocketReader.readControlFrame(WebSocketReader.java:202)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:103)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
	at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1162)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:636)
	at java.lang.Thread.run(Thread.java:764)

And I also found a found a duplicate call to stop in the logs:

22:55:49.569499| [Sync]: {C4Replicator#1} Duplicate call to stop()...

Do you this this could be the cause of a document not replicated?
I can’t seem to find why this is happening.

Thank you in advance.

blake.meike · November 16, 2022, 10:02pm

That is not a failure. That is just an informational stack trace logged in an unusual condition.
It does look like something is interrupting the connection.
Really, though, I’m going to need logs to figure out what is happening. This is certainly not the cause but it might be part of the explanation.

Martin.l · November 18, 2022, 8:48am

Hi @blake.meike

Sorry about that
I added an attachment including Couchbase logs during the problem.
The title might be misleading as the document is not replicated but it also disappears from the pendingDocumentIds.

CBL.zip (23.1 KB)

Thank you in advance!

Martin.l · November 18, 2022, 1:00pm

Blake,

We did some more testing, the only way to reproduce the issue is to start replication with 4G/3G and switch to wifi during the sync.
Could there possibly be something to do when changing the network type? Restart the replicator maybe?
I’m a bit lost to be honest on why this is happening.

Thank you.

Martin.l · November 18, 2022, 2:36pm

Finally found a solution!

I’m not sure if this is normal but by setting the option to reset checkpoints when starting the replicator, and restarting the database, the ghost document finally got replicated to the cluster.

I would really like your opinion on this.
Tell me if you need anything like specific logs or else.

Thank you!

blake.meike · November 18, 2022, 5:36pm

I’m looking at the logs now. I’m the Android guy and replication is, in large part, handled by the native library, LiteCore. I may need help from them to figure out what is going on. Back to you as quickly as possible.

In the maintime, just FYI:

It is unlikely that the type of connection affects replication. Not impossible, but not likely. I don’t think LiteCore can even tell what the layer 2/3 connection looks like. It is more likely to be the interruption in the connection that is the issue.
Resetting the checkpoints causes the replication to re-check the entire dataset, to see what needs to be synchronized. There are two reasons this might work: a) because the checkpoint has, somehow, been corrupted, or b) because it re-establishes the connection, at which point everything proceeds normally.

Back to you sometime today

blake.meike · November 18, 2022, 6:54pm

Hey @Martin.l Can you tell me:

What kind of replication is this? Continuous? Push/Pull?
When do you start a replication and when do you stop it
I see a lot of this: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname". Can you verify that your SGW is visible on both networks?

Martin.l · November 18, 2022, 7:10pm

Hi @blake.meike,

Thank you for taking the time to check the logs.
The replication is continuous and push only and it is stopped when going background or killing the app.
I’ll check the SGW as soon as possible but I think this just might be turning the device offline I’m not sure.

blake.meike · November 18, 2022, 7:22pm

This reminds me a little bit of something that I’ve seen before: Both sides of the connection decide to break it at about the same time. I have, as I mentioned might be necessary, referred this to my LiteCore colleagues, for an opinion. Next update is likely to be next week.

… and, btw, I see that you said PUSH/continuous in your OP. Feh…

blake.meike · November 19, 2022, 12:33am

Ok. I’ve heard back from the LiteCore team. Neither of us sees anything odd about that log.

It shows a replicator that cannot, initially, find its target. Eventually, though, it does get the connection open. Replication proceeds until all of the documents have been pushed and the replicator goes idle. Sometime later, the local application calls stop() at least twice. The replicator shuts down the connection normally.

It looks to us as if the replication is behaving exactly as it should.

geraldapeoples · November 21, 2022, 7:31pm

Ahah … This sounds familiar, my app creates documents representing audit results locally, I’ve had a few instances (very few and very random) of the audit documents not synchronising, it’s push/pull, 3.0.2 client,the documents refuse to sync and the users have had to start and finish a new audit (time and money and reputation) … as it has been very intermittent I’ve never been able to reproduce in dev, I will try now with starting with 4G and switching to wifi to see if I can replicate … not sure that ‘replication is behaving exactly as it should’ is an answer.

blake.meike · November 21, 2022, 7:59pm

I, on the other hand, have yet to see any evidence that it is not.

Not cool to talk trash with nothing to back it up. If you can show me an actual problem, I’ll be all over it.

geraldapeoples · November 28, 2022, 9:43am

Not cool to tell a user they are talking trash.

blake.meike · November 28, 2022, 5:07pm

My apologies. Do you have any evidence to support your claim?

blake.meike · November 28, 2022, 8:52pm

@deepthi1 : Unfortunately, I don’t have domain specific knowledge of your app. If you have a replication problem I will, absolutely, pursue it!

pasin · December 5, 2022, 6:13pm

@Martin.l I just looked at the attached CBL.zip and I could find anything mentioning about “fd052b01-44ec-4426-896d-cbb17b79802c” doc. In general, if a change (revision) is not successfully sent to SG especially due to the connection issue, it will get retried next time. For this case, if I understand it correct, it doesn’t get retried. To debug this issue, we would need to have CBL logs and SG logs when the issue is reproduced. Otherwise it would be difficult to understand what was going on.

Can you reproduce the issue and re-send the logs at the time that the issue happens?

Topic		Replies	Views
Couchbase Lite Replicator Not Syncing after write Couchbase Lite android	14	1354	August 5, 2022
Replication sometimes not working Mobile	10	1976	November 28, 2016
Continuous Replication, New Documents are not created Sync Gateway	2	1192	May 2, 2018
Sync Gateway Replication Android Sync Gateway	0	1513	June 22, 2016
SG rejects newly created document with document ID of deleted document: Channel access missing Sync Gateway	3	2072	December 23, 2017

Sync Gateway replication missing document

Related topics