We have an issue with Couchbase replication, and are now wondering if our expectations are perhaps a bit off.
The part of our stack that is relevant include the following pieces:
- Couchbase - version 5.0.x Community edition, clustered
- Kafka Connect with the Couchbase Source connector
- Our application
The flow of data and logic here follows the following pattern:
- Data comes into Couchbase - documents being written
- Kafka Connect Couchbase source connector picks of changes and writes events to a topic in Kafka
- Our application sees the events in the Kafka topic and starts reading from Couchbase
Our setup normally consists of a single Couchbase cluster in a single data center, and this have been working fine for a long time now. However, as of late one of our clients have been wanting to expanding into a secondary data center. The first thought was to let Couchbase handle all the replication since everything else in the setup is actually derived in one way or another from Couchbase data (more or less). We set this up using uni-directional replication (master -> slave).
What we have been experiencing however is that while we do get the events in the slave cluster as we would expect, the order is not the same as in the master cluster. I can’t be sure whether this means the documents in Couchbase are written in this order or whether the issue lies with the source connector and the events, but I’m guessing Couchbase actually replicates vbucket by vbucket separately, and thus this is actually “expected” behaviour.
It’s a bit hard to find exact details on how Couchbase replication (at least in the main documentation) works, but that might be because as users we are not supposed to care about the details.
Now, our application doesn’t exactly depend on the order of events, but there are some dependencies between documents in Couchbase. Meaning one single logical entity consists of 3 separate documents. In this case, what seems to be happening is that one of the documents (the one triggering the Kafka event the application cares about) is written in Couchbase and the resulting event seems to reach the application before the other documents are replicated.
What is the semantics for Couchbase replication? I realize we probably shouldn’t depend on a certain replication order always being true, and we can certainly handle this case with fallbacks and retries, but if there is something intrinsically incorrect in our understanding of Couchbase replication that makes this kind of behaviour unfeasible, we certainly need to educate ourselves. Also, what kind of “lag” / latency can we realistically expect in an otherwise fully working and connected environment?
If the issue here lies with the replication being done on a vbucket level, I guess the only real solution is to either fallback / retry until the document do exist or to make sure all linked documents are hashed in the same bucket?
Thanks in advance for any pointers or tips!