I have structured streaming job that streams from couchbase, persistence polling interval is100ms. Observed a weird case where there was huge load in a single batch and spark job went to second attempt, and it missed processing few records in that batch. How can this happen. I am maintaining a checkpoint folder in hdfs.
Whenever I see a report of documents missing from a query, I recall that queries on indexes only find what is indexed. If there are documents that have not yet been indexed, they will not be found by the query. I wonder if that is what you are observing? It would be useful to have more details. Also open a ticket with customer support