(I should caveat this by saying that I’m a relatively new Couchbase employee. These are my personal opinions, and not necessary the official Couchbase line.)
Quite honestly, I feel that achieving absolute 100% certainty on a write with any database is a very tough ask that you’re going to need to do a lot of work at the application layer to achieve. Forget Couchbase and transactions for a sec and pretend you’re writing a single SQL UPDATE to a classical single-node RMDBS over a local network, something of an ideal case for durability. Even here, you can hit problems. Say you send the UPDATE, but then your application immediately crashes (or the network goes down). What do you do now when the app comes back up (or the network comes back)? That update may or may not have reached the database, and you may or may not have lost it - no fault of the database, simply a reality of networks and code being fallible.
If you absolutely can’t lose an update under any perfect storm of edge cases, no matter what, then it’s going to be a lot of work. I’m not even certain if it’s possible to get there, but I’d probably start by maintaining some sort of persistent log, where I store mutations before putting them in the database. E.g before you write the mutation, you write it to the log. So if your application crashes and resumes, it can read this log, find any mutations that the database didn’t acknowledge, then read the database and check if that mutation was successfully written or not. (You may wonder why the database doesn’t maintain that log for you, but you have similar problems - what if you ask the database to write, then your app immediately crashes? You have no idea if that mutation made it to the database’s log or not.)
Of course, there’s still problems: what if your app crashes before you can write the mutation to the log? But say you can solve that, perhaps by storing all end-user interactions as soon as possible into a local persistent event source queue. Though of course, you could have the hard-drive fail just after you attempt the write, and then your app crashes…
You may be seeing where I’m coming from. This is just trying to write a single SQL UPDATE into a single-node RMDBS, and achieving the 100% durability you want across the whole system is already near impossible. Add a distributed database and multiple-document transactions into the mix, and it’s getting much harder. And to stress the point, you’re not trying to solve a Couchbase problem here, you’re trying to solve the fundamentally very hard problem of durability in the face of unreliable networks, hardware, and applications.
Ultimately, I feel a human factor needs to be considered with durability. The end-user, waiting for her account transfer on the banking website to complete, will get frustrated at the lack of confirmation, refresh the webpage, see it’s not gone through, and try it again (or call support). Your automated systems will detect two identical transactions in quick succession and flag it for human review. That kind of thing.
Maybe that’s all a bit wishy-washy and theorectical. Taking your specific question, e.g. what to do if a durable Couchbase mutation fails:
If it’s an idempotent mutation, try it again. And aim to be idempotent as much as possible. E.g. if you’ve got an amount to debit, perhaps create a key based on the event’s time plus the amount, and write that as a subdoc upsert into a map. Like this:
Now, if that fails to durably write, you can just retry that subdoc upsert.
If you can’t make it idempotent, it’s tricky. Say your write failed, your app crashed, you’ve restarted and looked at your log and found that a mutation wasn’t acknowledged. I think you’d need to do getFromReplica on all nodes to see if it was written to all. If not, you’d work out what the doc should look like (no subdoc mutations), and write the full doc to make sure all nodes are correctly set.
Anyway, that was a very long answer, apologies. But this is a very complex topic.