I am doing lots of small appends to Couchbase documents (binary) until they don’t reach the size of about 50Kb and then I start a new document. Reaching the 50Kb takes about 5000 appends to the given document.
Looking at the usage stats, as the document is growing and getting closer to 50Kb the replication between the Couchbase nodes start taking up more and more bandwidth and when the process rolls over the document and starts writing into a new document and appending to that then the bandwidth usage drops and starts creeping up again as the document’s size grows.
This tells me that at each append the whole document is being replicated across the nodes (that seems the be the only explanation why the same append takes up more and more bandwidth as the document size grows).
This sounds like a bad design - making appends operations very non-optimal (having the whole document replicated at each append).
Would be my understanding incorrect? Any way to get around this, to make this more efficient?
you are right. with existing versions of couchbase server, we do replicate the full document. There is obvious benefit in network efficiency to replicate less data. However it isn’t free to be smart about replicated differences between full vs delta changes between all sorts of updates. My question is, would be ok trading network efficiency for higher latency? or higher cpu utilization?
Yes, I understand that there could be some tradeoffs, although in an append-only document type that might not necessarily be the case. For example you can take a look at how Apache Kafka runs an append-only data structure to keep appending to a “topic” in a distributed cluster without rewriting the whole topic at each append on each cluster causing high I/O and high inter-node network traffic.
In either case, I am not saying that Couchbase necessarily needs to implement what Kafka is doing, as for that we can just use Kafka. I was just looking for a confirmation for the behavior I was seeing, so I can design around it (by making smaller, “caching” appending docs which would then roll up to bigger “storage” append docs, which would get append to in bulk).
Again, thanks for confirming it! Now I know how to work around this.
Ps.: Maybe it would be beneficial for others to mention this in the “raw append” part of the documentation (http://developer.couchbase.com/documentation/server/4.0/developer-guide/raw-append-prepend.html), which is even mentioning the raw append to store logs in Couchbase documents and considers this “efficient”. Maybe a note here saying that this is only “efficient” for the client (nor read of the whole doc and then write), not necessarily for the nodes (where there is a whole doc read and write). There is a note about increasing document size here, but making the nodes’ I/O and network traffic penalty clear might help others to understand it better too.