Ambiguous & UnAmbiguous timeout errors

I’m seeing a lot of AmbiguousTimeoutException and UnAmbiguousTimeoutException errors in the past couple of weeks. This wasn’t an issue earlier, and we haven’t introduced any major load recently. Let me break it down:

  1. AmbiguousTimeoutException with KeyValueErrorContext
  2. AmbiguousTimeoutException with SubdocumentErrorContext
    3.Reason:* key_value_sync_write_in_progress
  • We observed this while performing many mutations in quick succession (with SD). To mitigate, we added a 100ms sleep between mutations, which reduced—but didn’t eliminate—the errors.
  1. UnAmbiguousTimeoutException with KeyValueErrorContext

An example error would be like

UnAmbiguousTimeoutException(<ec=14, category=couchbase.common, message=unambiguous_timeout (14), context=KeyValueErrorContext:{‘retry_attempts’: 0, ‘key’: ‘85dc77f0-ccc4-4a6a-94ff-51770258cdde’, ‘bucket_name’: ‘xxx’, ‘scope_name’: ‘xxx’, ‘collection_name’: ‘eventlist’, ‘opaque’: 405}, C Source=/couchbase-python-client/src/kv_ops.cxx:218>)

Cluster setup:
Couchbase 7.6.1 EE with 3 Data nodes, 2 nodes with query & eventing, 1 Index node
SDK: 4.3.5 on Python 3.12

async def __new_cnx_cluster():
    global __CBS_CLUSTER

    cluster = __CBS_CLUSTER
    cnx_str = config.dbcnx

    if cluster:
        return cluster

    print(f"Connect to ({cnx_str}) with {config.dbuid} creds.", flush=True)

    _authenticator = PasswordAuthenticator(config.dbuid, config.dbpwd)
    _timeout_options = ClusterTimeoutOptions(
        query_timeout=timedelta(seconds=36),
        kv_timeout=timedelta(seconds=60),
        views_timeout=timedelta(seconds=36),
    )

    try:
        cluster = await AsyncCluster.connect(
            cnx_str,
            options=ClusterOptions(
                authenticator=_authenticator,
                timeout_options=_timeout_options
            ),
            authenticator=_authenticator,
        )
        await cluster.on_connect()
    except Exception as e:
        print("excCBSClusterCNX", e)
        print("Sentry Logger", e)

    __CBS_CLUSTER = cluster
  1. Why is it required to pass the authenticator both to ClusterOptions and also separately to connect()? If I remove it from one, I get an error.
  2. It feels like the timeout settings aren’t always being respected. Could something be overriding them?
  3. I noticed BestEffortRetryStrategy is available for Java SDK—do we have an equivalent or similar strategy available in the Python SDK?
  4. Are there any suggestions to completely eliminate these timeout issues? As mentioned, this setup was stable until a couple of weeks ago.

We observed this while performing many mutations in quick succession

You might want to slow down. The SDK has to hold all the requests from the time they are requested until there is a response (or timeout). If you are performing say… 100,000 requests per second, and your timeout is 36 seconds (that’s really long btw). Then the SDK will have to hold up to 3.6 million requests. If each requests is to insert a 10k document, that’s 36GB of requests. You might want to keep a counter (or counting semaphore) of in-flight request - increment it when a request is submitted and decrement it when the request completes (or times-out). And not exceed a “reasonable” number (100? 500?). Think of it as 100 requests completed 100% is more “progress” than 10,000 requests that are 1% complete.

  1. It should only be necessary to pass the authenticator in options. See Start Using the Python SDK | Couchbase Docs
    If you show the logging and errors, I might be able to figure out what is going on in your case.

  2. There are lots of different timeouts. Your code query_timeout, kv_timeout and view_timeout. There is also a timeout for making connections. There is a tiny bit of fudging (100ms?) in the timeouts to ensure that if a timeout could be enforced on both the server and the client, that. the timeout on the server expires first as that allows the server to clean up nicely. If you are having doubts about connectivity, it’s useful to add a cluster.wait_until_ready(timedelta(seconds=10)) between the connect() and using the connections.

  1. If I’m not mistaken, the default behavior of the python SDK (which relies on the C++ SDK) is the equivalent of BestEffort (i.e. if the operation is retryable, and the failure is retryable, the SDK will retry until the timeout is reached).

  2. Making the connection and using the connection are decoupled and asynchronous. Because of (3), any failure can result in a timeout. The timeout from the operation just means “I tried and tried and tried to do the operation, but the timeout expired without success”. (Unambiguous means that it is safe to retry the operation, Ambiguous means that perhaps the operation did get applied on the server, and that retrying may result in it being applied a second time). Sometimes there is a “cause” along with the timeout exception. Otherwise it’s necessary look at the logging of what happened leading up to the timeout. There is information here Logging | Couchbase Docs regarding how to get more logging.

SDK Doctor is useful for diagnosing SDK connectivity issues.

If you post the errors and the logs, I might be able to help you with them.

Hi Michael,

Thanks for your valuable inputs and suggestions!

Quick updates and a few follow-up questions from my side:

  1. The CSD timeouts we’re seeing now are mostly limited to key_value_sync_write_in_progress, with other timeout types becoming very minimal after we increased the sleep time. However, we still notice these timeouts even when a single mutate_in operation updates just 3–4 paths. Is there anything else I should be checking or tweaking? Any ideas or alternatives to fix this issue?

  2. Would you recommend reducing the kv_timeout to 16 seconds instead of the current 36 seconds to better handle these operations?

  3. Is there any supported way to update more than 16 paths in a single mutate_in call for CSD?

  4. I haven’t yet tried logging the connection purely via the options block — I’ll test that shortly and share the logs for better insights.

Just for context:
We’re using FastAPI for the API layer, where the Couchbase cluster and buckets are initialized during application startup and reused across all endpoint calls. There is a good enough time between the connection creation time and the first use.

Thanks again for your help! Let me know if you require additional info.

without seeing the code, I have to make some deductions and guesses that I would not have to make if I could see the code.
I was guessing that the timeouts are mostly due to having a large number of concurrent requests. The best way to control that is to count them with a counting semaphore that would block when that count is reached. Given that the timeouts are on key_value_sync_write_in_progress - that indicates that the calls have a Durability specified - is that needed for your application? If the calls do not wait for the durability to be satisfied, the calls will be much, much faster. A kv operation can take less than a microsecond (not millisecond, microsecond) on the server. The SDK timeouts also include the network round-trip. Still,16 seconds is an eternity. The default is 2.5 seconds.
There is no way to update more than 16 paths - that limitation is in the server. The reason being that there is complexity to handling the atomicity and that becomes excessive beyond a small number of paths. (I wonder if this might be making subdoc operations with multiple paths that specify durability excessively long).

Hi, I’ve shared the code here — solving this could potentially help us address some of the other related issues as well.

The errors are primarily being raised inside the save_vtlist coroutine.

For reference, the cluster and bucket initialization happens in the __init__.py
I’ve copied the relevant parts as docstrings in the file itself. Let me know if you need any additional information.

I’ll post another file separately regarding why cluster initialization fails when only options are provided

Thank you so much for the support!

I don’t see anything in the code that limits the number of concurrent in-flight requests. Or even count the number of concurrent requests. Or log every operation. To troubleshoot execution, it’s helpful to understand the execution.

Perhaps not directly related to the timeout, but it looks like if the DocumentNotFoundException is returned when _ == 1, the same mutate_in() call will be made with _ == 2, getting the same DocumentNotFoundException.

 for _ in range(3):
       try:
           await collection_vt.mutate_in(sid, csd_list)
           break
       except DocumentNotFoundException:
           if _ == 2:
               await collection_vt.upsert(sid, {"sid": sid, "available": vtlist})
               break
       except CouchbaseException:
           await aio.sleep(0.1)
           continue
       except Exception as e:
           print(f"record_available_vtlist error({sid}) : ", e)
           break

Yeah, this is how the current setup is. Could you share a reference or code snippet that demonstrates handling concurrent in-flight requests using an async connector?

You could use the synchronous api.
You could use only one WORKER instead of 5.
You could use a rate-limiting semaphore loaddriver/src/main/java/com/example/load/LoadThread.java at master · mikereiche/loaddriver · GitHub
You could use a counter such as maxRequestsInParallel (also in LoadThread).
You could use a longer timeout, such as 7.5 seconds, since your code makes up to 3 attempts (of 2.5 seconds).