Data loss on replication

Hey,

We recently experienced data loss during replication for one of our beta users and we have no idea how it happened.

We organize data based on facility id (channels are defined based on facility id and on each android device we use one database per facility), so any user that has access to a facility can log on and ‘sync’ (initiate a push and pull replication at the same time) to get any changes made remotely for that facility and/or push up changes made locally.

It’s been hard to get reliable info on what sequence of replications led up to the one where data loss occurred, but as far as I understand, it was simply that one person logged into the account and synced (started push and pull replications) without any data on the device. The pull replication consisted of ~40,000 documents, and no attachments, the push obviously consisted of no documents. Following this, a different user, who had generated a large quantity of new local data (which was not in the remote database at the point of the initial sync) performed a sync, after which a large number of documents were missing, all of which were created in the last month and a half (approximately).

My sync function is:

 function (doc, oldDoc) {
            if (getType() == "Facility") {
                            channel(doc._id);
                    } else if (getType != "User") {
                            channel(doc.facility_id);
                    }

                    channel(doc.channels);

                    function getType() {
                        return (isDelete(doc) ? oldDoc.type : doc.type);
                    }

                    function isCreate() {
                        return (oldDoc == null && doc._deleted != true);
                    }

                    function isUpdate() {
                        return (!isCreate(oldDoc) && !isDelete(doc));
                    }

                    function isDelete() {
                        return (doc._deleted == true);
                    }

                    function validateNotEmpty(name, value) {
                            if (value == null || value.length == 0 || value.trim().length == 0) {
                                    throw({forbidden: name + " is empty."});
                            }
                    }

                    function validateReadOnly(name, value, oldValue) {
                            if (value != oldValue) {
                                throw({forbidden: name + " is read-only."});
                            }
                    }


            };

The other notable thing is that the server was running low on space.

I have no idea where to start with this, and am happy to post the log for the entire sync operation (one push and one pull), but it’s pretty long. Is there anything in particular that you would suggest I look for? Is it possible that my understanding about how replications are supposed to work is wrong? I know this isn’t much to go on, but I’m pretty lost.

@sam.wilks92 looks like you have a typo in the second call to getType. Also, I think you could end up with a null reference the way getType is written (slim chance that isDelete returns true and oldDoc is null).

Is there a pattern to which docs are missing (i.e. the locally created ones, or …)?

Overall your approach seems fine.

Hod

Hey @hod.greeley ,

It deleted every doc that was created after Sept 16th, both locally and remotely that’s really the only pattern I have to go on.

@hod.greeley do you have any idea what clues I should look for in the logs

@sam.wilks92 hmm, lots of directions to go with this. Pretty strange that everything’s gone.

Memory on SG shouldn’t be the issue except under really unusual circumstances. SG works off a memory cache, so you’ll start to slow with garbage collection and delays loading from CB Server, but you have to really push it for it to fail. (This depends on server config like swap space. In effect, you have to be pushing it so hard that it forces the system to run out of virtual memory, at which points everything on the system will run into problems.)

  1. To confirm, where are you able to inspect for the documents? CB Server? The CB Lite local db?
  2. In the logs I’d look for errors, especially related to Javascript.
  3. Look for doc deletion events. I don’t have it in front of me, but I think you’ll see lines with “delete” and/or “tombstone” in them.

Hod

Have you looked at the bucket using the Couchbase Server admin web UI? If the documents were deleted in the Couchbase Mobile sense, the docs will still be in the bucket, it’s just that their current contents will be a “tombstone” stub of just {"_deleted":true}.

Unfortunately the old revision will still be gone, because it expires after a while. But at least this can help us narrow down what happened.

Hey @jens, @hod.greeley

Turns out this is probably just user error - the client likely hadn’t synced their data since Sept 16th (we got incorrect reports from our sales staff). There weren’t any tombstones, there is no mechanism in our code to delete this many items, and I checked a backup we had from a few weeks before the client had their problem and there still weren’t any records follow Sept 16th.

However, I’d like to confirm this in the sync gateway logs, but am not sure how to tell what qualifies a push replication. There are number of instances where it looks like they synced, but these are likely syncs into an empty database, in other words, just pulls. I can’t confirm this though, because I don’t know what to look for. (it’s also possible that the channel information was corrupted for this user at some point, in which case their data is on the server, I just don’t know how to access it).

@sam.wilks92 A push replication that actually pushes documents will include a call to /{db}/_bulk_docs - checking the logs for a _bulk_docs call by the user in question is probably the quickest way to see whether they attempted a push replication.

Hey @adamf ,

Sorry I didn’t make this clear, I start a push replication every time the user selects the sync function in the app. Is there a way to tell if anything was actually pushed? Like a document count or something? It’s possible that push replications were initiated but no documents were ever actually pushed to the server (if a user initiates sync with an empty database they’ll end up with a pull replication, because the push won’t have anything to push up).

@sam.wilks92 Right - if the user actually has docs to push, you should see HTTP logging for a _bulk_docs request in the Sync Gateway logs, as the user in question. If the user starts a push replication but there isn’t any data to push, I don’t believe the client will generate any _bulk_docs traffic on Sync Gateway.

@jens can probably shed more light on the client-side logging.

The Replication class has a property showing the number of docs transferred, and you can observe changes (with platform-specific API.) However, this property is more for driving a progress bar, not for any critical app functions.

Hey @jens

Anything specifically in the logs? I don’t think I’ll be able to repro this

Cheers

Anything specifically in the logs?

Can you be more specific? I’m not sure exactly what you’re asking here.

@jens Is there anywhere in the logs where I could see how many docs were transferred during a bulk post?

@jens @adamf Small update:

the last call to /{db}/_bulk_docs was on Oct 10th - after the expected date of Sept 16th, that said there are only 4 calls to the endpoint on that date - I’m guessing that means there was a small push? There are, however, numerous calls to /{db}/_revs_diff after that date. I’m guessing that every replication makes numerous calls to that endpoint, but I wanted to confirm

Thanks!