Couchbase Transactions in Java API 1.1.3 fails when multiple transactions are running in parallel

Server Couchbase 7.0 Beta

Test code below:

val lockingWhatnot = ensureUpserted[Barcode, Whatnot](Whatnot.dummy)
// code will block until above item was retrieved back from Couchbase
// lockingWhatnot is really in the database

      (1 to 20).par.foreach {
        _ =>
          Whatnots.inTransaction {
            implicit $ =>
              Whatnots.inTransaction.insert(Whatnot.dummy)

              Whatnots.inTransaction.replace {
                Whatnots.inTransaction.need(lockingWhatnot.ID)
              }
              Whatnots.inTransaction.insert(Whatnot.dummy)
          }()
      }
    }

Where inTransactions:


  def inTransaction[R](f: AttemptContext => R)(t: Throwable => Boolean = _ => true): R = {
    var result: Option[R] = None
    var transactionID: Option[String] = None
    try {
      transactions.run {
        (t: AttemptContext) =>
          transactionID = Some(t.transactionId())
          result = Some(f(t))
      }
    } catch {
      case f: TransactionFailed =>
        f.result().log().logs().toArray.foreach(System.err.println(_))
        if (Option(f.getCause).isDefined && t(f.getCause)) {
          throw f.getCause
        } else {
          throw f
        }
      case f: Throwable =>
        throw f
    } finally {
      transactionID.foreach(cleanupTransaction)
    }
    result.get
  }

Produces the below logs:

23:09:11.080 57/scala-execution-context-global-57/f4b7a config: atrs=20480, expiry=30000msecs durability=MAJORITY per-txn config= durability=Optional.empty, supported=Supported {extensions=[EXT_TRANSACTION_ID, EXT_DEFERRED_COMMIT, EXT_TIME_OPT_UNSTAGING, EXT_BINARY_METADATA, EXT_CUSTOM_METADATA, EXT_QUERY], protocol=2.0}
23:09:11.081 57/scala-execution-context-global-57/f4b7a/e6516 First mutated doc in txn is 'default.default.cihcs-Whatnot_c.hub.constraint:𪜂𣻕ᘲ漳简𩞿뵣𥵞⼟𫼗𤜣𨀄屄ę𦝇𬙌옺𒃠ཟ𦓚ﭭ' on vbucket 156, so using atr _txn:atr-156-#36e8
23:09:11.081 57/scala-execution-context-global-57/f4b7a/e6516 expiration time remaining is 30000 millis (capped from 30000)
23:09:11.081 57/scala-execution-context-global-57/f4b7a/e6516 about to set ATR default:default.transactions._txn:atr-156-#36e8 to Pending
23:09:11.119 57/scala-execution-context-global-57/f4b7a/e6516 set ATR default:default.transactions._txn:atr-156-#36e8 to Pending in 38ms, got CAS (start time) cas=1606774150946357248,seqno=315,vbucket=156
23:09:11.119 57/scala-execution-context-global-57/f4b7a/e6516 about to insert staged doc default.default.cihcs-Whatnot_c.hub.constraint:𪜂𣻕ᘲ漳简𩞿뵣𥵞⼟𫼗𤜣𨀄屄ę𦝇𬙌옺𒃠ཟ𦓚ﭭ as shadow document, cas=Optional.empty
23:09:11.151 57/scala-execution-context-global-57/f4b7a/e6516 inserted doc default.default.cihcs-Whatnot_c.hub.constraint:𪜂𣻕ᘲ漳简𩞿뵣𥵞⼟𫼗𤜣𨀄屄ę𦝇𬙌옺𒃠ཟ𦓚ﭭ got cas=1606774150978994176,seqno=317,vbucket=156, in 31ms
23:09:11.152 57/scala-execution-context-global-57/f4b7a/e6516 about to insert staged doc default.default.cihcs-Whatnot_e.fvp3tpTSfG5wwZlfhbGyRgKoJF2pf37V as shadow document, cas=Optional.empty
23:09:11.200 57/scala-execution-context-global-57/f4b7a/e6516 inserted doc default.default.cihcs-Whatnot_e.fvp3tpTSfG5wwZlfhbGyRgKoJF2pf37V got cas=1606774151020019712,seqno=321,vbucket=428, in 48ms
23:09:11.200 57/scala-execution-context-global-57/f4b7a/e6516 getting doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg, resolvingMissingATREntry=<empty>
23:09:11.232 57/scala-execution-context-global-57/f4b7a/e6516 doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg is in a transaction, looking up its status from ATR (MAV read)
23:09:11.263 57/scala-execution-context-global-57/f4b7a/e6516 got error while getting doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg: {err=ActiveTransactionRecordNotFound,ec=FAIL_OTHER,msg='43d01a06-cada-423f-942a-cbb5be7cc491'}
23:09:11.263 57/scala-execution-context-global-57/f4b7a/e6516 caught exception 'TransactionOperationFailed {ec:FAIL_OTHER, cause:com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound, retry:false, autoRollback:true, raise:TRANSACTION_FAILED}' in runBlocking, rethrowing
23:09:11.263 57/scala-execution-context-global-57/f4b7a/e6516           com.couchbase.transactions.error.internal.TransactionOperationFailedBuilder.build(TransactionOperationFailedBuilder.java:59)
23:09:11.263 57/scala-execution-context-global-57/f4b7a/e6516           com.couchbase.transactions.AttemptContextReactive.lambda$null$14(AttemptContextReactive.java:606)
23:09:11.263 57/scala-execution-context-global-57/f4b7a/<> finishing attempt off after error 'TransactionOperationFailed {ec:FAIL_OTHER, cause:com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound, retry:false, autoRollback:true, raise:TRANSACTION_FAILED}'
23:09:11.263 57/scala-execution-context-global-57/f4b7a/e6516 auto-rolling-back on error
23:09:11.263 57/scala-execution-context-global-57/f4b7a/e6516 rollback AttemptContextReactive{id=e6516,state=ROLLED_BACK,atr=default:default.transactions._txn:atr-156-#36e8,staged=[INSERT default.default.cihcs-Whatnot_c.hub.constraint:𪜂𣻕ᘲ漳简𩞿뵣𥵞⼟𫼗𤜣𨀄屄ę𦝇𬙌옺𒃠ཟ𦓚ﭭ, INSERT default.default.cihcs-Whatnot_e.fvp3tpTSfG5wwZlfhbGyRgKoJF2pf37V]} expiryOvertimeMode=false isAppRollback=false
23:09:11.096 60/scala-execution-context-global-60/67044 starting attempt 0/6704485b-ab33-41d4-9f28-e066de1e028b/ccc7b28a-f5fd-4b48-906e-da84401d05e9
23:09:11.096 60/scala-execution-context-global-60/67044 config: atrs=20480, expiry=30000msecs durability=MAJORITY per-txn config= durability=Optional.empty, supported=Supported {extensions=[EXT_TRANSACTION_ID, EXT_DEFERRED_COMMIT, EXT_TIME_OPT_UNSTAGING, EXT_BINARY_METADATA, EXT_CUSTOM_METADATA, EXT_QUERY], protocol=2.0}
23:09:11.263 57/scala-execution-context-global-57/f4b7a/e6516 aborting ATR default:default.transactions._txn:atr-156-#36e8 isAppRollback=false ambiguityResolutionMode=false
23:09:11.097 60/scala-execution-context-global-60/67044/ccc7b First mutated doc in txn is 'default.default.cihcs-Whatnot_c.hub.constraint:𐬇𝔓蠥𑒛뼌𨏐𓂙𤸿谧𗇲顯𦬗𘉲𡬧䤮𢇥𫽑𥥣콴' on vbucket 414, so using atr _txn:atr-414-#aeeb
23:09:11.298 57/scala-execution-context-global-57/f4b7a/e6516 aborted ATR default:default.transactions._txn:atr-156-#36e8
23:09:11.097 60/scala-execution-context-global-60/67044/ccc7b expiration time remaining is 29999 millis (capped from 29999)
23:09:11.298 57/scala-execution-context-global-57/f4b7a/e6516 deleting inserted doc default.default.cihcs-Whatnot_c.hub.constraint:𪜂𣻕ᘲ漳简𩞿뵣𥵞⼟𫼗𤜣𨀄屄ę𦝇𬙌옺𒃠ཟ𦓚ﭭ with cas 1606774150978994176
23:09:11.097 60/scala-execution-context-global-60/67044/ccc7b about to set ATR default:default.transactions._txn:atr-414-#aeeb to Pending
23:09:11.123 60/scala-execution-context-global-60/67044/ccc7b set ATR default:default.transactions._txn:atr-414-#aeeb to Pending in 25ms, got CAS (start time) cas=1606774150947733504,seqno=340,vbucket=414
23:09:11.321 57/scala-execution-context-global-57/f4b7a/e6516 deleted inserted doc default.default.cihcs-Whatnot_c.hub.constraint:𪜂𣻕ᘲ漳简𩞿뵣𥵞⼟𫼗𤜣𨀄屄ę𦝇𬙌옺𒃠ཟ𦓚ﭭ, mt Optional[mt{vbID=156, vbUUID=247073107826988, seqno=321, bucket=default}]
23:09:11.123 60/scala-execution-context-global-60/67044/ccc7b about to insert staged doc default.default.cihcs-Whatnot_c.hub.constraint:𐬇𝔓蠥𑒛뼌𨏐𓂙𤸿谧𗇲顯𦬗𘉲𡬧䤮𢇥𫽑𥥣콴 as shadow document, cas=Optional.empty
23:09:11.321 57/scala-execution-context-global-57/f4b7a/e6516 deleting inserted doc default.default.cihcs-Whatnot_e.fvp3tpTSfG5wwZlfhbGyRgKoJF2pf37V with cas 1606774151020019712
23:09:11.154 60/scala-execution-context-global-60/67044/ccc7b inserted doc default.default.cihcs-Whatnot_c.hub.constraint:𐬇𝔓蠥𑒛뼌𨏐𓂙𤸿谧𗇲顯𦬗𘉲𡬧䤮𢇥𫽑𥥣콴 got cas=1606774150980435968,seqno=342,vbucket=414, in 30ms
23:09:11.362 57/scala-execution-context-global-57/f4b7a/e6516 deleted inserted doc default.default.cihcs-Whatnot_e.fvp3tpTSfG5wwZlfhbGyRgKoJF2pf37V, mt Optional[mt{vbID=428, vbUUID=56783730644498, seqno=323, bucket=default}]
23:09:11.155 60/scala-execution-context-global-60/67044/ccc7b about to insert staged doc default.default.cihcs-Whatnot_e.NeR2huD3LRxweN3mFuI9cEhhVOzApW5w as shadow document, cas=Optional.empty
23:09:11.363 57/scala-execution-context-global-57/f4b7a/e6516 marking ATR default:default.transactions._txn:atr-156-#36e8 as rollback complete
23:09:11.400 57/scala-execution-context-global-57/f4b7a/e6516 rollback - atr rolled back
23:09:11.202 60/scala-execution-context-global-60/67044/ccc7b inserted doc default.default.cihcs-Whatnot_e.NeR2huD3LRxweN3mFuI9cEhhVOzApW5w got cas=1606774151020937216,seqno=384,vbucket=947, in 47ms
23:09:11.400 57/scala-execution-context-global-57/f4b7a/e6516 Adding cleanup request for transactions/_txn:atr-156-#36e8 to run in 30000 msecs
23:09:11.202 60/scala-execution-context-global-60/67044/ccc7b getting doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg, resolvingMissingATREntry=<empty>
23:09:11.400 57/scala-execution-context-global-57/f4b7a/e6516 added attempt TransactionAttempt{id=e6516,state=ROLLED_BACK,atr=default:default.transactions._txn:atr-156-#36e8} after error
23:09:11.238 60/scala-execution-context-global-60/67044/ccc7b doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg is in a transaction, looking up its status from ATR (MAV read)
23:09:11.400 57/scala-execution-context-global-57/f4b7a/<> reraising original exception 'TransactionOperationFailed' com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound: 43d01a06-cada-423f-942a-cbb5be7cc491
23:09:11.400 57/scala-execution-context-global-57/f4b7a TransactionOperationFailed retryTransaction=false
23:09:11.265 60/scala-execution-context-global-60/67044/ccc7b got error while getting doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg: {err=ActiveTransactionRecordNotFound,ec=FAIL_OTHER,msg='43d01a06-cada-423f-942a-cbb5be7cc491'}
23:09:11.400 57/scala-execution-context-global-57/f4b7a/e6516 converted TransactionOperationFailed TRANSACTION_FAILED to final error com.couchbase.transactions.error.TransactionFailed: Transaction has failed with cause 'com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound: 43d01a06-cada-423f-942a-cbb5be7cc491'
23:09:11.265 60/scala-execution-context-global-60/67044/ccc7b caught exception 'TransactionOperationFailed {ec:FAIL_OTHER, cause:com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound, retry:false, autoRollback:true, raise:TRANSACTION_FAILED}' in runBlocking, rethrowing
23:09:11.265 60/scala-execution-context-global-60/67044/ccc7b           com.couchbase.transactions.error.internal.TransactionOperationFailedBuilder.build(TransactionOperationFailedBuilder.java:59)
23:09:11.265 60/scala-execution-context-global-60/67044/ccc7b           com.couchbase.transactions.AttemptContextReactive.lambda$null$14(AttemptContextReactive.java:606)
23:09:11.265 60/scala-execution-context-global-60/67044/<> finishing attempt off after error 'TransactionOperationFailed {ec:FAIL_OTHER, cause:com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound, retry:false, autoRollback:true, raise:TRANSACTION_FAILED}'
23:09:11.265 60/scala-execution-context-global-60/67044/ccc7b auto-rolling-back on error
23:09:11.265 60/scala-execution-context-global-60/67044/ccc7b rollback AttemptContextReactive{id=ccc7b,state=ROLLED_BACK,atr=default:default.transactions._txn:atr-414-#aeeb,staged=[INSERT default.default.cihcs-Whatnot_c.hub.constraint:𐬇𝔓蠥𑒛뼌𨏐𓂙𤸿谧𗇲顯𦬗𘉲𡬧䤮𢇥𫽑𥥣콴, INSERT default.default.cihcs-Whatnot_e.NeR2huD3LRxweN3mFuI9cEhhVOzApW5w]} expiryOvertimeMode=false isAppRollback=false
23:09:11.265 60/scala-execution-context-global-60/67044/ccc7b aborting ATR default:default.transactions._txn:atr-414-#aeeb isAppRollback=false ambiguityResolutionMode=false
23:09:11.301 60/scala-execution-context-global-60/67044/ccc7b aborted ATR default:default.transactions._txn:atr-414-#aeeb
23:09:11.301 60/scala-execution-context-global-60/67044/ccc7b deleting inserted doc default.default.cihcs-Whatnot_c.hub.constraint:𐬇𝔓蠥𑒛뼌𨏐𓂙𤸿谧𗇲顯𦬗𘉲𡬧䤮𢇥𫽑𥥣콴 with cas 1606774150980435968
23:09:11.339 60/scala-execution-context-global-60/67044/ccc7b deleted inserted doc default.default.cihcs-Whatnot_c.hub.constraint:𐬇𝔓蠥𑒛뼌𨏐𓂙𤸿谧𗇲顯𦬗𘉲𡬧䤮𢇥𫽑𥥣콴, mt Optional[mt{vbID=414, vbUUID=229571635143498, seqno=346, bucket=default}]
23:09:11.340 60/scala-execution-context-global-60/67044/ccc7b deleting inserted doc default.default.cihcs-Whatnot_e.NeR2huD3LRxweN3mFuI9cEhhVOzApW5w with cas 1606774151020937216
23:09:11.368 60/scala-execution-context-global-60/67044/ccc7b deleted inserted doc default.default.cihcs-Whatnot_e.NeR2huD3LRxweN3mFuI9cEhhVOzApW5w, mt Optional[mt{vbID=947, vbUUID=1968439985624, seqno=386, bucket=default}]
23:09:11.368 60/scala-execution-context-global-60/67044/ccc7b marking ATR default:default.transactions._txn:atr-414-#aeeb as rollback complete
23:09:11.401 60/scala-execution-context-global-60/67044/ccc7b rollback - atr rolled back
23:09:11.401 60/scala-execution-context-global-60/67044/ccc7b Adding cleanup request for transactions/_txn:atr-414-#aeeb to run in 30000 msecs
23:09:11.401 60/scala-execution-context-global-60/67044/ccc7b added attempt TransactionAttempt{id=ccc7b,state=ROLLED_BACK,atr=default:default.transactions._txn:atr-414-#aeeb} after error
23:09:11.401 60/scala-execution-context-global-60/67044/<> reraising original exception 'TransactionOperationFailed' com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound: 43d01a06-cada-423f-942a-cbb5be7cc491
23:09:11.402 60/scala-execution-context-global-60/67044 TransactionOperationFailed retryTransaction=false
23:09:11.402 60/scala-execution-context-global-60/67044/ccc7b converted TransactionOperationFailed TRANSACTION_FAILED to final error com.couchbase.transactions.error.TransactionFailed: Transaction has failed with cause 'com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound: 43d01a06-cada-423f-942a-cbb5be7cc491'


40a5a9a0-338e-45a4-bb98-94e185c1c2f7
com.couchbase.transactions.error.attempts.ActiveTransactionRecordNotFound: 40a5a9a0-338e-45a4-bb98-94e185c1c2f7
	at com.couchbase.transactions.components.DocumentGetter.lambda$lookupStatusFromATR$7(DocumentGetter.java:198)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:88)
	at reactor.core.publisher.FluxMap$MapSubscriber.onError(FluxMap.java:126)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:100)
	at reactor.core.publisher.Operators.error(Operators.java:196)
	at reactor.core.publisher.MonoError.subscribe(MonoError.java:52)
	at reactor.core.publisher.Mono.subscribe(Mono.java:4213)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:97)
	at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onError(FluxDoFinally.java:129)
	at com.couchbase.client.core.Reactor$SilentMonoCompletionStage.lambda$subscribe$0(Reactor.java:172)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
	at com.couchbase.client.core.msg.BaseRequest.succeed(BaseRequest.java:143)
	at com.couchbase.client.core.io.netty.kv.KeyValueMessageHandler.decodeAndComplete(KeyValueMessageHandler.java:322)
	at com.couchbase.client.core.io.netty.kv.KeyValueMessageHandler.decode(KeyValueMessageHandler.java:301)
	at com.couchbase.client.core.io.netty.kv.KeyValueMessageHandler.channelRead(KeyValueMessageHandler.java:228)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at com.couchbase.client.core.io.netty.kv.MemcacheProtocolVerificationHandler.channelRead(MemcacheProtocolVerificationHandler.java:84)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at com.couchbase.client.core.deps.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
	at com.couchbase.client.core.deps.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:311)
	at com.couchbase.client.core.deps.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:425)
	at com.couchbase.client.core.deps.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at com.couchbase.client.core.deps.io.netty.handler.flush.FlushConsolidationHandler.channelRead(FlushConsolidationHandler.java:152)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at com.couchbase.client.core.deps.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at com.couchbase.client.core.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at com.couchbase.client.core.deps.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at com.couchbase.client.core.deps.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at com.couchbase.client.core.deps.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
	at com.couchbase.client.core.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at com.couchbase.client.core.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at com.couchbase.client.core.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at com.couchbase.client.core.deps.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at com.couchbase.client.core.deps.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.couchbase.client.core.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:834)
  • When transactions.run is synchronized, there is no problem.
  • Code never fails with (1 to 20).foreach (notice that there is no par).
  • Code rarely fails with (1 to 2).par.
  • Code sometimes fails with (1 to 5).par, but more likely.
  • When I create an Apache Pool with 8 Transaction objects and borrow them while using par, code fails just like (1 to 20).par fails (in other words, I have 8 transaction objects and I make sure that they can not be used in parallel)
  • Same problem from 1.1.0 to 1.1.3.
  • Transactions are set to a metadata collection that has a primary index only.

Configuration:

transactions {
        # Couchbase default is 15 seconds.
        expiration-time = "30 seconds"
        # Couchbase default is 15 seconds.
        key-value-timeout = "30 seconds"
        # Couchbase default is 1000 and must be less than 20480!
        maximum-number-of-transaction-records = 20480
        # Couchbase default is 60 seconds.
        cleanup-window = "60 seconds"
      }

What does attempts.ActiveTransactionRecordNotFound? It seems fishy to me, can transactions run in Couchbase in parallel? Why are active transaction records not being found?

Thanks!

In general, I’m seeing a lot of these ActiveTransactionRecordNotFound errors, and transactions are not retried then. The application must retry these, but then, after a while, it seems that the object that is frequently contested somehow gets corrupted, or the cluster gets corrupted and not a single transaction would go through.

Code that otherwise worked with CB 6.5 now throws ActiveTransactionRecordNotFound errors with CB 7.0.

@daschl / @graham.pople can you please assist ?

1 Like

Hi Zoltan,

It could be a case that document “default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg” was involved in some previous transaction and the ATR related to that transaction was deleted.

Please let us know if this has happened,It yes, we have a bug already filed for it and we are working on it.

1 Like

Hi @praneeth.bokka!

I’m not sure about that exact ID, but in the use case yes, there is 1 document that is involved in a lot of transactions.

In our use-case, we have a global counter-like object, that must be updated in a transaction since no number must be wasted during “count”.

I’m not sure I fully understand the mechanics of ATRs, at least we did not delete that. I can see that the ActiveTransactionRecordNotFound transaction ID has been added to that counter as metadata and that it can not be modified by “hand” through the Couchbase UI.

We further experimented and found out that the more a single document is involved in a transaction at the same time, the more we see this problem and the application must retry. But (and this is a bigger problem), somehow the metadata can get corrupted in a way that no further retries would work - the document gets “corrupted”.

hi @zoltan.zvara

Could you please set maximum-number-of-transaction-records to 20479 or lower . For now lets go with 1024

ATR = ActiveTransactionRecord. This record contains the metadata information on a transaction
Each transaction could have multiple attempts and each attempt’s metadata is saved in these ATR’s.
If ATR is deleted manually/otherwise or write permissions on its collection are revoked then it interferes with the Transaction cleanup process and thus might effect future transactions working on the documents related to earlier transaction.
This is a very rare case and we have a filed bug and are working on it.

In your case:

  1. May be atr is deleted(You have confirmed its not the case)
  2. Please check if write permissions on its collection are revoked

It is strange to notice that number of threads is affecting the likelihood of the issue occurrence. On the top of it, the issue looks to be due to ATR of the previously failed transaction. However to help debug this, could you please try using another document for global counter to see if issue still happens.

1 Like

Hi @praneeth.bokka!

We experimented with a different value of maximum-number-of-transaction-records, but the problem still persisted. One other problem that came around with low maxATRs is that in the development environment, cleaning was not kicking off, therefore 1024 got quickly used up. We see that there is no way to manually kick off the cleaning in the Transaction client. It would be nice for development environments.

I can confirm that we do not delete ATRs nor revoke permissions.

We specified a separate transactions collection for the Transaction Client, so all metadata should go there. If any metadata goes to the collection where documents are stored, well that may be it, but hopefully not.

Also, we reproduced the same problem in two separate projects with completely different objects. In each case, we observe the same parallelization effect that I posted in the first place.

Hi @zoltan.zvara, I’m back from vacation and rather concerned by this issue. Let me catch up with your questions in order:

can transactions run in Couchbase in parallel?

Yes, absolutely :slight_smile:

I’m not sure I fully understand the mechanics of ATRs, at least we did not delete that.

There’s a bit on the underlying mechanics of transactions here that’s worth a read:

Essentially Active Transaction Records (ATRs) are metadata documents created and used by the transactions library. The ActiveTransactionRecordNotFound error indicates that the transactions library needed to read or write such an ATR but could not locate it, and is hence unable to make progress. It should only happen if the user has accidentally deleted the ATR, but you mention you’re sure that’s not happening so let’s proceed on that basis.

In which case, the only thing I can think of is there’s something suspect with the code, and my lack of Scala 3-fu is not aiding me here. Ah, but I think I may have it… (and thank you for the excellent and complete problem description, I wouldn’t have got here without it):

(1 to 20).par.foreach will parallelise those operations. First, parallelising of operations has only been tested with the async (reactive) API, not the blocking one you’re using. In Java, it is not so easy and intuitive to parallelise operations in this way, and it’s assumed that those power-users wanting parallelism will reach for the reactive/async API. That said, I cannot think of a reason why parallelising the blocking API wouldn’t work - just want to mention that you’re a little off the beaten path here.

So parallelising the blocking API isn’t itself the issue, but what I think is, is that there are certain restrictions to bear in mind when you’re parallelising operations, detailed here:

In particular, the first mutating operation (which also writes the ATR), must be done in serial.

I suspect what’s happening is that:
Operation 1 starts, chooses an ATR, and saves that decision in internal state.
Before it can actually write the ATR, operation 2 starts. It sees an ATR is already selected, and continues, and then hits some transient problem - perhaps a write-write conflict with another transaction, since you mention only seeing this problem in contended situations with concurrent transactions.
Operation 2 tries to abort this attempt, which involves aborting the ATR - but this hasn’t been created yet - so you get the error seen.

Essentially, it’s a known, documented race condition, and one that must be solved currently by the user using the documented methods (e.g. serialising the first operation). This is something of a trade-off between allowing power-users to parallelise operations, and not penalising the more common path with the locks it would require to handle this automagically inside the transactions library. (I should mention - it’s on my todo list to investigate how much of a penalty the locking would introduce, and to possibly re-evaluate that trade-off.)

I would suggest that unless you truly need the improved latency of parallelising operations (e.g. for larger transactions) that you perform them in serial, purely as it puts you on the path more travelled (and tested).

2 Likes

Hi @graham.pople,

& welcome back, I hope you had a pleasant time! Thanks for catching up with this issue!

It should only happen if the user has accidentally deleted the ATR, but you mention you’re sure that’s not happening so let’s proceed on that basis.

I want to make sure that this is not the case, I would like to know where these ATRs are stored. If these are stored where the Transaction API is configured, like:


          TransactionConfigBuilder.create()
            .logOnFailure(true, Event.Severity.TRACING)
            .cleanupLostAttempts(true)
            .cleanupClientAttempts(true)
            .metadataCollection(javaTransactionsCollection) // here?

we have a dedicated collection for those:

              val javaTransactionsCollection =
                Storage.getOrCreateCollection("transactions", javaScope, javaBucket) // the name of the collection will be "transactions"

in the application’s scope.

image

(It seems that the UI would not show all binary documents, only the first one.) Maybe this is a UI bug?

I suspect what’s happening is that: […]

I’ll make an attempt to confirm this with the async bucket.

I would suggest that unless you truly need the improved latency of parallelising operations (e.g. for larger transactions) that you perform them in serial, purely as it puts you on the path more travelled (and tested).

However, the serial operation is not an option, since these transactions are invoked by HTTP calls. Neverthless, synchronizing the API that initiates the transactions solves the problem and forces serial execution.

Let me get back to you with more tests based on your suggestions.

1 Like

So, the default out-of-the-box behaviour is to use the default collection, of the bucket, of the first mutated document in the transaction.
As of 1.1.3 you can use the new metadataCollection() config parameter (as in your example) to specify the collection to use instead.
This certainly isn’t mandatory, we think the default collection strategy will be fine for the majority of users, but metadataCollection() gives you that control if you need or want it. There’s some discussion in the docs on when you might choose to use metadataCollection(). So, absolutely fine to create that dedicated “transactions” collection for this.

I’m rereading your code this morning and am less sure about my diagnosis… I see “Whatnots.inTransaction {” will start a separate transaction (“transactions.run((ctx) → {}}”), so I think that code spawns 20 parallel transactions, but inside each transaction, the insert/replace/insert is done in serial? If so, then this should be fine…

Ah yes, now I think to check the logs you posted, I see I was wrong. The issue is instead here:

23:09:11.200 57/scala-execution-context-global-57/f4b7a/e6516 getting doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg, resolvingMissingATREntry=
23:09:11.232 57/scala-execution-context-global-57/f4b7a/e6516 doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg is in a transaction, looking up its status from ATR (MAV read)
23:09:11.263 57/scala-execution-context-global-57/f4b7a/e6516 got error while getting doc default.default.cihcs-Whatnot_e.2OxjCDuTU4M9uUsCGsnOmzacYOm47ABg: {err=ActiveTransactionRecordNotFound,ec=FAIL_OTHER,msg=‘43d01a06-cada-423f-942a-cbb5be7cc491’}
So, the transactions library is trying to get a document (presumably triggered by your replace code), it’s discovered that document is in another transaction, and it’s trying to fetch the ATR entry for that other transaction to determine whether it is allowed to overwrite that document (e.g. if the other transaction is expired or failed or completed), or if it must wait. But the ATR document no longer exists…
Ah, I’m doing a static analysis of the transactions code base now, and I might have spotted the issue. I’ll investigate further and get back to you.

1 Like

So, absolutely fine to create that dedicated “transactions” collection for this.

Good to hear!

but inside each transaction, the insert/replace/insert is done in serial? If so, then this should be fine…

Yes, all KV operations within the scope of a transaction context are executed in a single thread, sequentially.

So, the transactions library is trying to get a document (presumably triggered by your replace code), it’s discovered that document is in another transaction, and it’s trying to fetch the ATR entry for that other transaction to determine whether it is allowed to overwrite that document (e.g. if the other transaction is expired or failed or completed), or if it must wait. But the ATR document no longer exists…

How are ATRs removed and by whom and in which cases? I read somewhere, that ATRs are managed by both the client and the server.

How are ATRs removed and by whom and in which cases? I read somewhere, that ATRs are managed by both the client and the server.

Nothing should be removing the ATR documents. There is a cleanup process (update: I should say cleanup algorithm instead, distributed between all transactional clients), that will find ATR entries (e.g. records inside the ATRs) and remove those, but that won’t cause what you’re seeing. I suspect that there is a bug and the transactions library is looking in the wrong place for the ATR here - will get back to you once I’ve confirmed/denied that.

ATRs are managed by the transactions protocol. Until very recently this was purely on the client side, but with the introduction of support for queries in transactions library 1.1.3 and Couchbase Server 7.0 beta, elements of the transactions protocol will be executed server-side, in the query engine, if the transaction involves one or more queries.

Btw, can you let me know where you saw that the ATRs are partly managed by the server? I’d like to check that documentation.

1 Like

I can confirm that this is a bug in the transactions Java library. It’s found a document that’s involved in another transaction, and is looking up the ATR entry for that transaction to decide how to proceed. But it’s looking for the ATR on the document’s collection, instead of using the ATR collection information in the document’s metadata. So it reports the ATRNotFound error you see

I’ve fixed the bug and forthcoming release 1.1.4 will contain it - ETA for transactions releases is 3rd Tuesday of the month, e.g. 15th December. Thank you for reporting this and for the great problem description, and apologies for the inconvenience.

1 Like

@graham.pople thank you for your kind help and I’m happy that you figured it out!

A quick question, will 1.1.4 be compatible with the latest Scala library? As far as I know, 1.1.3 lags behind in some core-io(?) libraries, rendering it binary-incompatible with the latest Scala library, as of this speaking (Scala 1.1.0).

Do you plan to sync releases for the transactions library as well as for the Java/Scala libraries? We automate library upgrades and testing in CI using Scala Steward, and due to the above problem, Couchbase libraries must be upgraded “by hand” each month. Maybe others have or will have the same problem.

The aim is that we release Java SDK, Scala SDK, and core-io jars together on the first Tuesday of each month, and transactions on the third Tuesday, in a regular ‘train’ release model. Transactions has a transitive dependency on the Java SDK’s jar, which in turn has a transitive dependency on core-io. Generally, that transitive dependency will reflect the latest Java SDK release, as the intention is that people just pull in the transactions library, and allow the Java SDK to be transitively pulled in with it - which ensures binary compatibility.

I understand there’s more friction with your case, because you use the Scala SDK and hence can’t rely on the transitive dependency, and because of this Scala Steward automation, and that you would prefer it if all the jars were released atomically together, so you can automatically upgrade to latest versions of everything without needing to consider binary compatibility. It’s a reasonable request, I’ll take it on board and kick off an internal discussion on it.

1 Like

Oh, forgot to mention this one, I can replicate and my colleague @praneeth.bokka is looking into this. It’s filed as TXNJ-336.

1 Like

Great, I think everyone would benefit from quicker and more automated release/integration/upgrade cycles!

Related to this, we do have automated publication of snapshots which does pass a pretty high bar in testing. We have another level of testing/validation that goes behind the monthly cadence.

Point is, we agree we should ship early and often! We’re balancing how often with how quickly people can take things up and some additional longevity testing.

If you want in on the snapshots, we can point you in the right direction.

@ingenthr thanks, then we will remove these libraries from automation tools and update them by hand eventually.

Snapshots would be nice for regression testing, but for production patches, it feels like living on the edge all the time.

Hi @zoltan.zvara, this bug is fixed in Transactions for Java 1.1.4, released today. My apologies that you encountered it.

This build supports (and requires) Couchbase Java SDK 3.1.0.

We are continuing to discuss your request to have the Java SDK, Scala SDK, and transactions library released atomically.

1 Like