We have a background process that performs long polls of the _changes feed. It uses the since parameter, feeding it the last results from a previous poll. Lastseq is treated opaquely as a string, and we don’t pass any other filters, etc.
One of our test clusters has been “missing” documents recently, and when we started to dig into the logs we found some lines we hadn’t seen before. Specifically, lines like:
- 2016-06-20T15:14:42.872Z Cache: Received unexpected out-of-order change - not in skippedSeqs (seq 838056, expecting 848757) doc “-3cgde20IM2tQptJVgC3KyB” / “1-79d850ffae862b7e3d578b756950b62a”
This is happening on documents we create locally as well (not synced from couchbase lite). The trick is that when a document is “missed” it never shows up without “rewinding” or restarting the _changes feed to a much earlier starting sequence. We’ve also noticed that we’re getting really large split sequence ranges that stay open for many hours, and which usually encompass the range of these missed documents (in other words, we always see the “missing” documents/sequences when lastseq is of the form “nnnn::mmmm” where mmmm - nnnn has, at times, been greater than 10,000).
One last observation… If we look at the last sequence number and it is, say x, when we immediate create a document we’ll find that the sequence number assigned on this mis-behaving cluster, y, is much smaller than x. For example, I just pulled _sync:seq and got 848997. I then created a document and that document has sequence 837889. As a counterpoint, on a “working” cluster I see a _sync:seq value of 651493 and a newly created document picked up the sequence number 651494 as you would expect.
We found reference in some previous posts to issues that can occur with backup/restore and the last sequence number getting out of wack. We’ve confirmed that errors like the ones above do not seem to be hitting our other clusters. We use spot instances for our test cluster, and did recently have to rebuild the cluster from backups after AWS terminated one or more spot instances.
Anybody know anything more about that error above, and whether it could actually result in missed _changes when using longpoll? Is there a safe correction that can be applied, or is it a wash at this point? Really interested in understanding what could have gone wrong to get us here.
PS-These are three node clusters running Couchbase 4.0.0-4051-community and Sync Gateway 1.2.0(79;9df63a5). There is a sync gateway instance running on each of the nodes in the cluster.