XDCR - replication skipping specific items

bhawksfan · September 2, 2021, 3:24am

We’re currently investigating a problem in our couchbase server clusters where a select set of documents seem to be ignored by replication - in one specific cluster.

No matter what we’ve tried, we can’t seem to make XDCR recognize that it should be replicating this document. We’ve tried:

upserting the document with the same content in the source datacenter
pausing and resuming the replication
restarting the couchbase-server service on both source and destination servers
deleting the problematic document from the source, waiting, and inserting the document back in
creating the document specifically in the datacenter where it is missing, then deleting from the source datacenter (it doesn’t get deleted from the destination cluster)
based on prior experience, we’ve created a different set of documents within the bucket and they all get replicated just fine (my script ensures that it inserts into every vBucket)
we have multiple destination datacenters, and only one destination is missing the document

Is there anything we can look into to see why couchbase is ignoring replication for just this one document?

In case it helps, right around the time the document was being inserted in the source, we were rebalancing the destination cluster and a server being added was powered off unexpectedly. We were able to successfully rebalance the cluster once the server was back online.

neilhuang · September 7, 2021, 7:11pm

creating the document specifically in the datacenter where it is missing, then deleting from the source datacenter (it doesn’t get deleted from the destination cluster)

That to me means the target bucket’s document is winning the conflict resolution.
I would check to see if there’s any reason why the target bucket document keeps winning and thus the source doc is not being replicated, such as perhaps an application that is updating the document outside fo the replication topology, etc…

bhawksfan · September 7, 2021, 8:33pm

Can you elaborate on how to “check if there’s a reason why the target bucket document keeps winning”? I’d like to follow through on this lead.

To clarify (sorry if this wasn’t clear):
Under normal circumstances, the document is only ever created in the source datacenter. This document would have been added to the source datacenter around the time that the target datacenter’s rebalance got interrupted. We only added it directly to the target datacenter after repeated upserts of the document in the source datacenter proved unhelpful.

hyunjuV · September 8, 2021, 7:31am

The doc info on XDCR conflict resolution might be helpful – XDCR Conflict Resolution | Couchbase Docs

So, depending on the conflict resolution you are using (the default on a bucket is revId/sequence number), you can check the metadata of the documents to check the revId, cas values.

Can also check out the below xdcr forum question to see if any of the info there is helpful – Replication problem with Couchbase server 6.6.1

bhawksfan · September 8, 2021, 8:46pm

I grabbed the virtual extended attributes related to sequence number conflict resolution from each of our clusters. What’s interesting is that the sequence numbers are all over the place, so either I’m not getting the same sequence number that couchbase uses for conflict resolution, or the algorithm is more complex than “replicate if source[seqno] > dest[seqno]”. For example, “success1” had a higher seqno prior to the user upserting the document, while “success4” has a lower seqno even after the document is successfully replicated.

This is the xattrs prior to the user re-upserting the document into the source datacenter:

cluster   last_modified         seqno               CAS                 exptime
source    2021-09-02  02:36:38  0x0000000000023895  0x16a0e0c420b20000  0
missed    2021-09-02  02:53:54  0x000000000002d46e  0x16a0e1b566f10000  0
success1  2021-09-02  02:36:38  0x00000000000300f5  0x16a0e0c420b20000  0
success2  2021-09-02  02:36:38  0x000000000002f14c  0x16a0e0c420b20000  0
success3  2021-09-02  02:36:38  0x0000000000022a62  0x16a0e0c420b20000  0
success4  2021-09-02  02:36:38  0x000000000000e4c6  0x16a0e0c420b20000  0

This is the xattrs after the user re-upserted the document:

cluster   last_modified         seqno               CAS                 exptime
source    2021-09-08  15:28:09  0x00000000000238ae  0x16a2e2599a1e0000  0
missed    2021-09-02  02:53:54  0x000000000002d46e  0x16a0e1b566f10000  0
success1  2021-09-08  15:28:09  0x000000000003031c  0x16a2e2599a1e0000  0
success2  2021-09-08  15:28:09  0x000000000002f5f4  0x16a2e2599a1e0000  0
success3  2021-09-08  15:28:09  0x0000000000022bdd  0x16a2e2599a1e0000  0
success4  2021-09-08  15:28:09  0x000000000000e542  0x16a2e2599a1e0000  0

I reviewed the other post, and I do see that docs_failed_cr_source has increased starting on the same date as we started seeing this problem and is now at a constant 85 for this replication stream (it was 0). So it does seem that we’re on the right path.

hyunjuV · September 9, 2021, 10:51pm

If you are using the default conflict resolution mode on a bucket, then, the simplest is to look at the revid (simple counter incremented on every mutation). The default conflict resolution mode is also called “most update wins”. In the virtual extended attributes, see “revid” – e.g. ‘revid’: ‘2’

The revId can also be seen in the Admin UI → Documents

bhawksfan · September 11, 2021, 5:39am

We are using the “Sequence Number” conflict resolution.

Unfortunately, I don’t see rev or revid anywhere. The admin UI doesn’t allow me to see this metadata - “Warning: Editing of binary document is not allowed”. I’ve also tried N1QL (select meta() from bucket use keys [key]), but that doesn’t seem to work either. I get an empty result set for the binary document, and rev/revid is not included even for a JSON object I inserted separately.

This is the entirety of $document:

{
    "CAS": "0x16a2e2599a1e0000",
    "datatype": [
        "raw"
    ],
    "deleted": false,
    "exptime": 0,
    "last_modified": "1631114889",
    "seqno": "0x000000000002f5f4",
    "value_bytes": 440755,
    "vbucket_uuid": "0x0000217c2dd87e7b"
}

How can I get this revision ID, ideally through the Python SDK or REST request? How can we resolve these conflicts? We are starting to see this pattern show up in other buckets and other remote datacenters, so we really need to find a solution to this problem.

I know for this specific document type, they only insert into the source cluster, and allow XDCR to replicate to each of our other datacenters. Yes, we manually attempted some upserts/deletes in the remote datacenter, but those were only AFTER we detected that the document wasn’t getting replicated.

bhawksfan · September 13, 2021, 3:52am

Minor correction - my N1QL query was from the wrong bucket, but the revision id is still not available:

	[
        {
            "$1": {
                "cas": 1631114889527164928,
                "expiration": 0,
                "flags": 67108868,
                "id": "<redacted>",
                "type": "base64"
            }
        }
    ]

hyunjuV · September 13, 2021, 8:31am

The default conflict resolution (sequence number) is commonly referred to as the revId conflict resolution – just FYI. This is how I get the revid using python sdk.

$ python3 ./get_xttr.py 
SubdocResult<rc=0x0, key='airline_10', cas=0x16a454107d110000, tracing_context=140670996618832, tracing_output={}, specs=(Spec<GET, '$document', 262144>,), results=[(0, {'CAS': '0x16a454107d110000', 'datatype': ['json'], 'deleted': False, 'exptime': 0, 'flags': 0, 'last_modified': '1631521394', 'revid': '1', 'seqno': '0x00000000000001e2', 'value_bytes': 120, 'value_crc32c': '0x396ea2e5', 'vbucket_uuid': '0x00001d399caf82b3'})]>

where the contents of get_xttr.py is:
---- cut here ----
import logging
import sys
import couchbase.subdocument as SD

from datetime import timedelta

from couchbase.exceptions import CouchbaseException
from couchbase.cluster import Cluster
from couchbase.auth import PasswordAuthenticator

cluster = Cluster(
“couchbase://localhost”,
authenticator=PasswordAuthenticator(
“Administrator”,
“mypassword”))
bucket = cluster.bucket(“travel-sample”)

key=“airline_10”
result = bucket.lookup_in(key, [SD.get("$document", xattr=True)])
print (result)
---- cut here ----

bhawksfan · September 13, 2021, 1:58pm

Thanks again for your reply. That is exactly how I got the results I had provided before. What version was revid added to $document? We are currently using couchbase server community 5.1. We want to upgrade, but we currently need moxi. I don’t want to derail this conversation, but I want to be clear that we can’t simply upgrade to solve this.

While I realize knowing the revid would be quite helpful in confirming this is definitely what’s happening here, is there anything we can do to force replication from the source to the target (bypassing conflict resolution)?

The concern we have right now is that we have a number of documents that are not getting replicated due to this conflict resolution (a few more have been recently detected). While we can manually create documents on the target side, that does nothing to allow the next update to get replicated.

hyunjuV · September 13, 2021, 7:53pm

Sounds like you need the target to purge the document metadata after deleting the document (i.e. purge tombstones) so that there will be nothing on the target for the source document (with same doc key) to be in conflict with. You can review the Couchbase docs for your version on when tombstones are purged.

bhawksfan · September 16, 2021, 8:58pm

I don’t see any documentation regarding tombstones in couchbase 5.1. Based on the documentation for 5.5, it would be the metadata purge interval, which we have set to 3 days.

That doesn’t seem to solve the problem, though. To keep this simple, I’ll refer to datacenters A, B, and C and D. We have XDCR replication from A->B, A->C, A->D. All 4 datacenters have the same, 3-day metadata purge interval.

2021-09-01 - document was inserted into A: document was missing in B and C, but successfully replicated to D.
2021-09-02 - document was manually upserted into B: document is now present in A, B and D, yet still missing from C.
2021-09-08 - document was upserted into A: document in B had not been changed, document in A and D are the same, document is still missing in C.

With the timeline above, there would have been 2 metadata purge intervals somewhere between 2021-09-01 and 2021-09-08, so any tombstone present on C would have been purged prior to the upsert on 2021-09-08.

Is there any way to see if a tombstone exists for a given document?
Is there any way to get a list of all tombstones for a bucket?
Is there any way to force metadata to be purged immediately?
Is there any way to see what documents are being skipped due to conflict resolution?
Why would the stats for docs_failed_cr_source only increase over time? It has not dropped (even by 1) in the past week.
- Is this stat really a counter that is being displayed as if it were a gauge/rate in the admin UI?
- Or is couchbase repeatedly trying and skipping the same documents over and over?

I really appreciate your help with this. The documentation seems to be lacking how to gather this detailed, technical information. It’s good at describing the concepts and how couchbase uses them, but not how to inspect them.

bhawksfan · September 23, 2021, 8:02pm

@hyunjuV / @neilhuang - any guidance you can provide regarding these questions?

Is there any way to see if a tombstone exists for a given document?
Is there any way to get a list of all tombstones for a bucket?
Is there any way to force metadata (tombstones) to be purged immediately?
Is there any way to see what documents are being skipped due to conflict resolution?
Why would the stats for docs_failed_cr_source only increase over time? It has not dropped (even by 1) in the past week.
- Is this stat really a counter that is being displayed as if it were a gauge/rate in the admin UI?
- Or is couchbase repeatedly trying and skipping the same documents over and over?

hyunjuV · September 23, 2021, 9:55pm

Tombstone is just a deleted document (doc with empty/null body) – so, if the metadata for the doc key/id exists, and if “deleted” : true – then, that would be a tombstone. I think that in one of your outputs, the $document showed “deleted”:false – so, clearly, that document was not a tombstone (since the metadata is saying that the document has not been deleted).
docs_failed_cr_source should be showing you the total count (running total) since that replication spec started.
You can change the purge interval for tombstones but should be careful, as noted in the documentation. https://docs.couchbase.com/server/5.1/settings/configure-compact-settings.html#metadata-purge-interval

Kevin.Cherkauer · September 23, 2021, 9:59pm

In the code this is typed as a Prometheus MetricTypeCounter, which is documented here:

" A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart."

There are also some comments in code referencing this saying

docs failed source side conflict resolution. in this case the docs will be counted in docs_failed_cr_source stats

docs that will get rejected by target for other reasons, e.g., since target no longer owns the vbucket involved. in this case the docs will not be counted in docs_failed_cr_source stats

bhawksfan · September 24, 2021, 8:16pm

I just asked the user to re-upsert the document that has been missing from one of the remote clusters since 2021-09-02. The document is still not getting replicated. At that time, the docs_failed_cr_source stat did increment by one. To me, this suggests that a tombstone (or something like it) is not being cleaned up by the metadata purge interval.

Note that I do not seem to have a way to view tombstones (even immediately after deletion). Using this code, I get a DocumentNotFoundException:

source.quiet = True
key = 'conflict-tombstone-test'
meta_lookup = couchbase.subdocument.get('$document', xattr=True)

source.upsert(key, key, ttl=7*86400)
source.remove(key)
meta = source.lookup_in(key, [meta_lookup])
print('Metadata:', meta)

Results in:

couchbase.exceptions.DocumentNotFoundException: <Key=‘conflict-tombstone-test’, RC=0x12D[LCB_ERR_DOCUMENT_NOT_FOUND (301)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={‘status_code’: 1, ‘opaque’: 2, ‘cas’: 0, ‘key’: ‘conflict-tombstone-test’, ‘bucket’: ‘Static’, ‘collection’: ‘_default’, ‘scope’: ‘_default’, ‘context’: ‘’, ‘ref’: ‘’, ‘endpoint’: ‘:11210’, ‘type’: ‘KVErrorContext’}, Tracing Output={“conflict-tombstone-test”: {“debug_info”: {“FILE”: “src/callbacks.c”, “FUNC”: “subdoc_callback”, “LINE”: 985}}}>

Topic		Replies	Views
Document not being replicated Couchbase Server	6	1150	October 18, 2018
Missed documents in XDCR Couchbase Server	0	1655	October 16, 2015
Bucket replication between two clusters Couchbase Server	3	2042	November 26, 2016
XDCR Replication missing some items Couchbase Server xdcr	1	2139	July 2, 2020
XDCR remove document action not replicated on Couchbase Server 5.0.0 CE Couchbase Server	12	1608	December 19, 2017

XDCR - replication skipping specific items

Related topics