Client-Side timeout in Docker Swarm

Greetings,

we have been running CouchBase in Docker for more then a year in prototype configuration. The setup was quite simple, single-node CouchBase as a container (official community image 4.5.1), another container Node.js/Express application. It did not use Swarm or overlay networking because of older Docker setup.

In the meantime the project grew and we have prepared new platform based on Docker Swarm / Stack. Everything went well, migration was successful. After a while we noticed random HTTP 404s in our application which we could not reason about. Long story short, we found out a problem with CouchBase:

"CouchbaseError: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout","code":23

After heavy digging we were able to raise log levels and found more detail, like:

7663ms [I2d453069] {1} [ERROR] (cccp - L:165) <NOHOST:NOPORT> (CTX=0,) Could not get configuration: LCB_ETIMEDOUT (0x17)

Anyway in such simple setup it did not make sense to us, why we are getting timeouts. And the problem grew because we got these upon startup of our application where we perform “migrations”. Migration means we compare documents whether schema matches requirements and if not we sequentially migrate it to current version - one-by-one - no magic, no concurrency, just going through some hundreds of documents at the beginning.

We came across couple of indications, like Docker Overlay networking and Idle connections:

As a workaround we tried to upgrade to 5.1.1 - same and new problems and code adaptations which did not go well.

Last resort was to increase operation timeout to 10sec, which seemed enough … but no … now we are at 30sec.

And if you ask some basic questions like:

  • is the server under heavy load - no - it’s been idle
  • was there anything special in logs - nope :frowning:
  • what version of SDK do you use - 2.5.1
  • have you increased log level - yes
  • how many buckets do you have - 4, but only 1 have such amount of documents to migrate
  • is there any firewall - no, all ports are open between app and db

We do not want to keep such state with high timeouts because we feel there is something fishy going on. Do you have similar experience with Docker Overlay networking? Do you have some other suggestions how to resolve the issue or at least investigate?

Thank you,
Ladislav

Hi @lgazo,

It looks like some operation is timing out. Are you running any N1QL queries? I ask because there might be an inefficient index (or missing index) that would not be a problem at first, but could manifest later as your data grows in size.

Hi,

yes it times out. We don’t use N1QL. There are 5 views, pretty simple ones. And it is interesting that also simple operations like getting a document by ID do timeout.

I am still trying to find out a way how to calculate and monitor idle connections, as there was one topic in the forum pointing out, but I am not sure if this might be the root cause. Although Docker overlay network is slower, other applications perform normally.

I agree with @matthew.groves, and also note that the timeout there is related to fetching the configuration proactively to ensure the client’s configuration is kept up to date. The “NOHOST:NOPORT” is a bit suspicious there. You don’t say which SDK, but just a version. I see libcouchbase, but you could have Node.js/PHP/Python atop it.

At the moment, one constraint to using Couchbase Server in a docker setup is making sure the hostnames and ports resolve the same way they do in the server supplied configuration. My guess is there’s something about your setup that may not match that.

You can generally validate correctness and find any problems with SDK doctor. You’ll find pre-built binaries on the release page.

Usually the summary at the end is pretty easy to interpret. If you need help with interpretation, please feel free to post it or a link to it here.

Also, there’s a feature being worked on that should make multi-network configurations simpler, @anil may be able to point to some info on it or an issue to track.

Thanks @ingenthr, I ran SDK Doctor inside CouchBase container. I attached also the results of the analysis at the bottom. What I can tell so far is, that hostname ‘yeself-db.service.consul’ is resolved to the same IP from application and from inside DB container:

/opt/yeself-guide-server # ping yeself-db.service.consul
PING yeself-db.service.consul (10.0.11.8): 56 data bytes
64 bytes from 10.0.11.8: seq=0 ttl=64 time=0.805 ms

root@55cbc0433fdf:/tmp# ping yeself-db.service.consul
PING yeself-db.service.consul (10.0.11.8) 56(84) bytes of data.
64 bytes from 10.0.11.8: icmp_seq=1 ttl=64 time=0.062 ms

But what is weird is, that although port 11210 is opened and listening

root@55cbc0433fdf:/tmp# netstat -ntl|grep 11210
tcp        0      0 0.0.0.0:11210           0.0.0.0:*               LISTEN     
tcp6       0      0 :::11210                :::*

doctor’s analysis says it cannot connect to KV service at 11210 - I am pretty sure bucket name is correct, user and password as well. Although I am running version 4.5 without bucket passwords at the moment. My command to start doctor looks like this:

./sdk-doctor-linux diagnose couchbase://yeself-db.service.consul/guide -u admin -p TopSecretPassword

And btw. we use NodeJS SDK.

|====================================================================|
|          ___ ___  _  __   ___   ___   ___ _____ ___  ___           |
|         / __|   \| |/ /__|   \ / _ \ / __|_   _/ _ \| _ \          |
|         \__ \ |) | ' <___| |) | (_) | (__  | || (_) |   /          |
|         |___/___/|_|\_\  |___/ \___/ \___| |_| \___/|_|_\          |
|                                                                    |
|====================================================================|

Note: Diagnostics can only provide accurate results when your cluster
 is in a stable state.  Active rebalancing and other cluster configuration
 changes can cause the output of the doctor to be inconsistent or in the
 worst cases, completely incorrect.

08:05:54.865 INFO ▶ Parsing connection string `couchbase://yeself-db.service.consul/guide`
08:05:55.065 INFO ▶ Connection string identifies the following CCCP endpoints:
08:05:55.065 INFO ▶   1. yeself-db.service.consul:11210
08:05:55.065 INFO ▶ Connection string identifies the following HTTP endpoints:
08:05:55.065 INFO ▶   1. yeself-db.service.consul:8091
08:05:55.065 INFO ▶ Connection string specifies bucket `guide`
08:05:55.065 WARN ▶ Your connection string specifies only a single host.  You should consider adding additional static nodes from your cluster to this list to improve your applications fault-tolerance
08:05:55.065 INFO ▶ Performing DNS lookup for host `yeself-db.service.consul`
08:05:55.067 INFO ▶ Bootstrap host `yeself-db.service.consul` refers to a server with the address `10.0.11.8`
08:05:55.067 INFO ▶ Attempting to connect to cluster via CCCP
08:05:55.068 INFO ▶ Attempting to fetch config via cccp from `yeself-db.service.consul:11210`
08:05:55.074 ERRO ▶ Failed to fetch configuration via cccp from `yeself-db.service.consul:11210` (error: invalid bucket name/password)
08:05:55.074 INFO ▶ Attempting to connect to cluster via HTTP (Terse)
08:05:55.074 INFO ▶ Attempting to fetch terse config via http from `yeself-db.service.consul:8091`
08:05:55.086 INFO ▶ Identified the following nodes:
08:05:55.086 INFO ▶   [0] yeself-db.service.consul
08:05:55.086 INFO ▶                      fts:  8094,            indexScan:  9101,   indexStreamCatchup:  9104
08:05:55.086 INFO ▶         indexStreamMaint:  9105,                 capi:  8092,            projector:  9999
08:05:55.086 INFO ▶                     mgmt:  8091,           indexAdmin:  9100,            indexHttp:  9102
08:05:55.086 INFO ▶          indexStreamInit:  9103,                   kv: 11210,                 moxi: 11211
08:05:55.086 INFO ▶                     n1ql:  8093
08:05:55.086 WARN ▶ Your configuration was fetched via a non-optimal path, you should update your connection string and/or cluster configuration to allow CCCP config fetch
08:05:55.086 INFO ▶ Fetching config from `yeself-db.service.consul:8091`
08:05:55.104 INFO ▶ Received cluster configuration, nodes list:
[
  {
    "clusterCompatibility": 262149,
    "clusterMembership": "active",
    "couchApiBase": "http://yeself-db.service.consul:8092/",
    "hostname": "yeself-db.service.consul:8091",
    "interestingStats": {
      "cmd_get": 0,
      "couch_docs_actual_disk_size": 5.3653476e+07,
      "couch_docs_data_size": 5.3309869e+07,
      "couch_spatial_data_size": 0,
      "couch_spatial_disk_size": 0,
      "couch_views_actual_disk_size": 1.8277813e+07,
      "couch_views_data_size": 1.8248727e+07,
      "curr_items": 378,
      "curr_items_tot": 378,
      "ep_bg_fetched": 0,
      "get_hits": 0,
      "mem_used": 1.6401404e+08,
      "ops": 0,
      "vb_replica_curr_items": 0
    },
    "mcdMemoryAllocated": 23559,
    "mcdMemoryReserved": 23559,
    "memoryFree": 1.577699328e+10,
    "memoryTotal": 3.0880018432e+10,
    "os": "x86_64-unknown-linux-gnu",
    "otpNode": "ns_1@yeself-db.service.consul",
    "ports": {
      "direct": 11210,
      "proxy": 11211
    },
    "recoveryType": "none",
    "services": [
      "fts",
      "index",
      "kv",
      "n1ql"
    ],
    "status": "healthy",
    "systemStats": {
      "cpu_utilization_rate": 78.1725888324873,
      "mem_free": 1.577699328e+10,
      "mem_total": 3.0880018432e+10,
      "swap_total": 4.2949632e+09,
      "swap_used": 1.24063744e+09
    },
    "thisNode": true,
    "uptime": "252960",
    "version": "4.5.1-2844-community"
  }
]
08:05:55.107 ERRO ▶ Failed to connect to KV service at `yeself-db.service.consul:11210` (error: invalid bucket name/password)
08:05:55.118 INFO ▶ Successfully connected to MGMT service at `yeself-db.service.consul:8091`
08:05:55.122 INFO ▶ Successfully connected to CAPI service at `yeself-db.service.consul:8092`
08:05:55.130 INFO ▶ Successfully connected to N1QL service at `yeself-db.service.consul:8093`
08:05:55.132 INFO ▶ Successfully connected to FTS service at `yeself-db.service.consul:8094`
08:05:55.134 WARN ▶ Failed to perform KV connection performance analysis on `yeself-db.service.consul:11210` (error: %!d(string=invalid bucket name/password))
08:05:55.134 INFO ▶ Diagnostics completed

Summary:
[WARN] Your connection string specifies only a single host.  You should consider adding additional static nodes from your cluster to this list to improve your applications fault-tolerance
[WARN] Your configuration was fetched via a non-optimal path, you should update your connection string and/or cluster configuration to allow CCCP config fetch
[WARN] Failed to perform KV connection performance analysis on `yeself-db.service.consul:11210` (error: %!d(string=invalid bucket name/password))
[ERRO] Failed to fetch configuration via cccp from `yeself-db.service.consul:11210` (error: invalid bucket name/password)
[ERRO] Failed to connect to KV service at `yeself-db.service.consul:11210` (error: invalid bucket name/password)

Found multiple issues, see listing above.