Sync Gateway replication missing document

Hello,

I hoping this topic has not been created before, but I could not find one.
I’m using Couchbase Lite 3.0 for Android and I’m running into an issue, where a document is locally created without errors but it’s never replicated.

I’m using continuous mode and PUSH only configuration.
And I only reproduce this issue when changing network from 4G or 3G to wifi during replication.
In the DocumentReplicationListener, I managed to add some logs and the following documentId is actually returned in the pendingDocumentIds : fd052b01-44ec-4426-896d-cbb17b79802c

But it never gets replicated as they were no success or error for this document, and it does not exists on the cluster after the replication.

CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => ea8b5aea-7eb3-4dff-ab8e-48103b36d595
CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => fd052b01-44ec-4426-896d-cbb17b79802c
CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => 2f01c46f-1306-49ff-8d79-1bc6327e9285
CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => fe56b229-53a2-420c-81e6-e30993966590
CREATE ATTACHMENT SUCCESS => ATTACHMENT DOC ID => de03d0fe-658b-4114-96d4-8ca4d28d26fd
CREATE FORM SUCCESS => DOC ID => 74158f64-a2ff-4f78-88e3-7fe132026776
REPLICATOR PENDING DOCUMENT IDS => [ea8b5aea-7eb3-4dff-ab8e-48103b36d595, fd052b01-44ec-4426-896d-cbb17b79802c, fe56b229-53a2-420c-81e6-e30993966590, 2f01c46f-1306-49ff-8d79-1bc6327e9285, de03d0fe-658b-4114-96d4-8ca4d28d26fd]
REPLICATION SUCCESS => DOCUMENT ID => 74158f64-a2ff-4f78-88e3-7fe132026776
REPLICATOR PENDING DOCUMENT IDS => [fd052b01-44ec-4426-896d-cbb17b79802c, fe56b229-53a2-420c-81e6-e30993966590, 2f01c46f-1306-49ff-8d79-1bc6327e9285, de03d0fe-658b-4114-96d4-8ca4d28d26fd]
REPLICATION SUCCESS => DOCUMENT ID => ea8b5aea-7eb3-4dff-ab8e-48103b36d595
REPLICATOR PENDING DOCUMENT IDS => [2f01c46f-1306-49ff-8d79-1bc6327e9285, de03d0fe-658b-4114-96d4-8ca4d28d26fd]
REPLICATION SUCCESS => DOCUMENT ID => fe56b229-53a2-420c-81e6-e30993966590
REPLICATOR PENDING DOCUMENT IDS => []
REPLICATION SUCCESS => DOCUMENT ID => de03d0fe-658b-4114-96d4-8ca4d28d26fd
REPLICATION SUCCESS => DOCUMENT ID => 2f01c46f-1306-49ff-8d79-1bc6327e9285

I’m hoping someone can help, I can’t manage to understand why this is happening.
Thank you in advance.

Hi @Martin.l
So… can we get a little more information on this? Which version of CBL are you using? 3.0.2? 3.0.4? On what kind of device are you running the software?
Can you get the console log for the device (adb logcat). Perhaps, instead, you could us a Custom Logger and log somewhere convenient. Without the logs it is going to be pretty hard to guess at what is going on.

Hi Blake,

Thank you for the response, sorry about my custom logs.
I’m using Couchbase 3.0.0, and reproduced the issue on multiple android devices (8.1.0, 11)
For more information, i’m waiting for replication to start and then kill the app or just turn off mobile data multiple times.

I only got these logs in the error file:

22:53:42.325814| [WS] ERROR: {N8litecore4repl12C4SocketImplE#6} WebSocket failed to connect! (reason=Network error 2)
22:53:42.327769| [Sync] ERROR: {Repl#4} Got LiteCore error: Network error 2, "java.net.UnknownHostException: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname"
22:53:42.328518| [Sync] ERROR: {C4Replicator#5} State: connecting, progress=0.00%, error=Network error 2, "java.net.UnknownHostException: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname"
22:53:42.332405| [Sync] ERROR: {C4Replicator#5} Transient error (Network error 2, "java.net.UnknownHostException: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname"); attempt #2 in 2 sec...
22:53:42.333207| [Sync] ERROR: {C4Replicator#5} State: offline, progress=0.00%, error=Network error 2, "java.net.UnknownHostException: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname"

However I did found this in the info logs:

22:54:49.054690| [WS]: [JAVA] WebSocket CLOSED with error
java.net.SocketException: Software caused connection abort
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:114)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:139)
	at okio.Okio$2.read(Okio.java:140)
	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
	at okio.RealBufferedSource.request(RealBufferedSource.java:72)
	at okio.RealBufferedSource.require(RealBufferedSource.java:65)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:78)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
	at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1162)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:636)
	at java.lang.Thread.run(Thread.java:764)

22:54:49.058852| [WS]: {N8litecore4repl12C4SocketImplE#6} sent 6391033 bytes, rcvd 1068, in 13.733 sec (465390/sec, 78/sec)
22:54:49.066116| [BLIP]: {N8litecore4blip10ConnectionE#3} Closed with Network error 22: java.net.SocketException: Software caused connection abort
22:54:49.066346| [BLIP]: {N8litecore4blip6BLIPIOE#9} Notifying 3 outgoing messages they're canceled
22:54:49.066433| [Sync]: {Repl#4} Connection closed with Network error 22: "java.net.SocketException: Software caused connection abort" (state=2)
22:54:49.066533| [BLIP]: {N8litecore4blip6BLIPIOE#9} Notifying 3 incoming messages they're canceled
22:54:49.066710| [Sync]: BLIP sent 11 msgs (6391033 bytes), rcvd 3 msgs (1068 bytes) in 13.740 sec. Max outbox depth was 1, avg 1.00
22:54:49.066838| [Sync]: {N8litecore4repl6PusherE#7} activityLevel=stopped: pendingResponseCount=3, caughtUp=1, changeLists=0, revsInFlight=0, blobsInFlight=3, awaitingReply=2883, revsToSend=0, pushingDocs=3, pendingSequences=3
22:54:49.067940| [Sync]: {Repl#4} activityLevel=busy: connectionState=-1, savingChkpt=0
22:54:49.068052| [Sync]: {N8litecore4repl6PusherE#7} activityLevel=stopped: pendingResponseCount=2, caughtUp=1, changeLists=0, revsInFlight=0, blobsInFlight=3, awaitingReply=2883, revsToSend=0, pushingDocs=2, pendingSequences=0

Please tell me if you need anything more.
Thank you very much

Hello again @blake.meike

I managed to reproduce a simple case and found a few logs that may be relevant.
For information, i’m not deleting anything locally or even on the cluster.
I have the autoPurgeEnabled set to false on the replicator config
There’s this exception:

22:55:49.603345| [DB] ***: [JAVA] StateMachine@0x53238f3: unexpected state CLOSING [CLOSED, FAILED]
java.lang.Exception
	at com.couchbase.lite.internal.utils.StateMachine.assertState(StateMachine.java:124)
	at com.couchbase.lite.internal.replicator.AbstractCBLWebSocket$OkHttpRemote.onClosing(AbstractCBLWebSocket.java:198)
	at okhttp3.internal.ws.RealWebSocket.onReadClose(RealWebSocket.java:361)
	at okhttp3.internal.ws.WebSocketReader.readControlFrame(WebSocketReader.java:202)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:103)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
	at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1162)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:636)
	at java.lang.Thread.run(Thread.java:764)

And I also found a found a duplicate call to stop in the logs:

22:55:49.569499| [Sync]: {C4Replicator#1} Duplicate call to stop()...

Do you this this could be the cause of a document not replicated?
I can’t seem to find why this is happening.

Thank you in advance.

That is not a failure. That is just an informational stack trace logged in an unusual condition.
It does look like something is interrupting the connection.
Really, though, I’m going to need logs to figure out what is happening. This is certainly not the cause but it might be part of the explanation.

Hi @blake.meike

Sorry about that
I added an attachment including Couchbase logs during the problem.
The title might be misleading as the document is not replicated but it also disappears from the pendingDocumentIds.

CBL.zip (23.1 KB)

Thank you in advance!

Blake,

We did some more testing, the only way to reproduce the issue is to start replication with 4G/3G and switch to wifi during the sync.
Could there possibly be something to do when changing the network type? Restart the replicator maybe?
I’m a bit lost to be honest on why this is happening.

Thank you.

Finally found a solution!

I’m not sure if this is normal but by setting the option to reset checkpoints when starting the replicator, and restarting the database, the ghost document finally got replicated to the cluster.

I would really like your opinion on this.
Tell me if you need anything like specific logs or else.

Thank you!

I’m looking at the logs now. I’m the Android guy and replication is, in large part, handled by the native library, LiteCore. I may need help from them to figure out what is going on. Back to you as quickly as possible.

In the maintime, just FYI:

  • It is unlikely that the type of connection affects replication. Not impossible, but not likely. I don’t think LiteCore can even tell what the layer 2/3 connection looks like. It is more likely to be the interruption in the connection that is the issue.
  • Resetting the checkpoints causes the replication to re-check the entire dataset, to see what needs to be synchronized. There are two reasons this might work: a) because the checkpoint has, somehow, been corrupted, or b) because it re-establishes the connection, at which point everything proceeds normally.

Back to you sometime today

Hey @Martin.l Can you tell me:

  • What kind of replication is this? Continuous? Push/Pull?
  • When do you start a replication and when do you stop it
  • I see a lot of this: Unable to resolve host "sync-data.apnl-cloud.com": No address associated with hostname". Can you verify that your SGW is visible on both networks?

Hi @blake.meike,

Thank you for taking the time to check the logs.
The replication is continuous and push only and it is stopped when going background or killing the app.
I’ll check the SGW as soon as possible but I think this just might be turning the device offline I’m not sure.

This reminds me a little bit of something that I’ve seen before: Both sides of the connection decide to break it at about the same time. I have, as I mentioned might be necessary, referred this to my LiteCore colleagues, for an opinion. Next update is likely to be next week.

… and, btw, I see that you said PUSH/continuous in your OP. Feh…

Ok. I’ve heard back from the LiteCore team. Neither of us sees anything odd about that log.

It shows a replicator that cannot, initially, find its target. Eventually, though, it does get the connection open. Replication proceeds until all of the documents have been pushed and the replicator goes idle. Sometime later, the local application calls stop() at least twice. The replicator shuts down the connection normally.

It looks to us as if the replication is behaving exactly as it should.

Ahah … This sounds familiar, my app creates documents representing audit results locally, I’ve had a few instances (very few and very random) of the audit documents not synchronising, it’s push/pull, 3.0.2 client,the documents refuse to sync and the users have had to start and finish a new audit (time and money and reputation) … as it has been very intermittent I’ve never been able to reproduce in dev, I will try now with starting with 4G and switching to wifi to see if I can replicate … not sure that ‘replication is behaving exactly as it should’ is an answer.

I, on the other hand, have yet to see any evidence that it is not.

Not cool to talk trash with nothing to back it up. If you can show me an actual problem, I’ll be all over it.

Not cool to tell a user they are talking trash.

My apologies. Do you have any evidence to support your claim?

@deepthi1 : Unfortunately, I don’t have domain specific knowledge of your app. If you have a replication problem I will, absolutely, pursue it!

@Martin.l I just looked at the attached CBL.zip and I could find anything mentioning about “fd052b01-44ec-4426-896d-cbb17b79802c” doc. In general, if a change (revision) is not successfully sent to SG especially due to the connection issue, it will get retried next time. For this case, if I understand it correct, it doesn’t get retried. To debug this issue, we would need to have CBL logs and SG logs when the issue is reproduced. Otherwise it would be difficult to understand what was going on.

Can you reproduce the issue and re-send the logs at the time that the issue happens?