Retries occurred each 10 request to db

Hi!

I’m using Java 11 with JDK version 3.4.1

The infrastructure looks like API Gateway + Lambda + Couchbase server on an EC2 instance.

Each 6 - 10 request to Lambda occurred retries with “SERVICE_NOT_AVAILABLE” and “NODE_NOT_AVAILABLE” reasons



And then it appears as WARN that “Initializing the global config failed: UNKNOWN”

Code:

Cluster environment configuration: 

        Duration timeoutDuration = Duration.ofSeconds(30);

        Consumer<TimeoutConfig.Builder> timeoutConfig = item -> item
                .kvDurableTimeout(timeoutDuration)
                .kvTimeout(timeoutDuration)
                .queryTimeout(timeoutDuration)
                .searchTimeout(timeoutDuration)
                .connectTimeout(timeoutDuration)
                .analyticsTimeout(timeoutDuration)
                .disconnectTimeout(timeoutDuration);

Request to Couchbase:

 String queryAsString = "...";
 cluster.query(queryAsString);

Will appreciate your help!

It looks like the lambda does not have access to the cluster. Please make sure you’ve exposed all relevant ports (Couchbase Server Ports | Couchbase Docs) and can contact them externally.

I googled that it can be a possible cause and then exposed all ports
Here is the screenshot of inbound rules for lambda

Those are only the non-ssl ports. For ssl, the ssl ports also need to be accessible.

You can use curl to check that the ports are accessible - timeout means they are not exposed, anything else - include errors is exposed (curl will give errors when a non-http/s port is accessed).

I’m not using ssl connection way and have a Couchbase community edition.

Here is the line of connection to the cluster

 cluster = Cluster.connect("connection string", clusterOptions);

Is not opened SSL ports are still relevant for my case?

Presumably “connection string” is not your actual connection string :grinning:

No, the SSL ports shouldn’t be used if you’re not using SSL.

But this is still likely to be a basic connectivity/ports issue. You’ve shown the AWS lambda inbound rules, but where is the cluster hosted - what about the inbound & outbound rules configured there? Can you connect to the cluster from your laptop? sdk-doctor (GitHub - couchbaselabs/sdk-doctor: Application-server-side cluster connection diagnostics.) can be useful for identifying connectivity problems.

Cluster hosted on EC2 under the same account and region as lambda which uses Couchbase. Cluster has exactly the same security group as lambda (you can check the screenshot with lambda inbound rules above)

Thanks for helping, I will try to use sdk docktor

Tested cluster connection with sdk-doctor and didn’t notice any errors or issues:

Any ideas about possible reasons and ways how to diagnose this problem?

Will appreciate your help!

Each 6 - 10 request to Lambda occurred retries with

But they eventually succeed? If they fail can you show the stack trace?

The messages - especially the DEBUG message that says “ignored on purpose” is a not a fatal error. To get the configuration, couchbase will try the management service if it fails to get the config from the kv service. And it will also retry. The operation may timeout, but won’t fail when couchbase temporarily does not have a connection.

Can you show the connection string you are using? Can you show the SDK Doctor command and all the output from it?

I’m using Java 11 with JDK version 3.4.1

Without any particular change in mind, the newer versions tend to work better. Could you try 3.4.6?

They all eventually succeed, the problem is request time to couchbase it can take longer than a minute with a simple SELECT query. And in that case, the API Gateway will fail with a timeout response

Logs from SDK-Doctor:

19:21:33.010 INFO ▶ Parsing connection string `couchbase://HOST/BUCKET`
19:21:33.010 INFO ▶ Connection string identifies the following CCCP endpoints:
19:21:33.010 INFO ▶   1. HOST:11210
19:21:33.010 INFO ▶ Connection string identifies the following HTTP endpoints:
19:21:33.010 INFO ▶   1. HOST:8091
19:21:33.010 INFO ▶ Connection string specifies bucket `BUCKET`
19:21:33.010 WARN ▶ Your connection string specifies only a single host.  You should consider adding additional static nodes from your cluster to this list to improve your applications fault-tolerance
19:21:33.010 INFO ▶ Performing DNS lookup for host `HOST`
19:21:33.010 INFO ▶ Attempting to connect to cluster via CCCP
19:21:33.010 INFO ▶ Attempting to fetch config via cccp from `HOST:11210`
19:21:33.018 INFO ▶ Selected the following network type: default
19:21:33.018 INFO ▶ Identified the following nodes:
19:21:33.018 INFO ▶   [0] HOST
19:21:33.018 INFO ▶                indexHttp:  9102,   indexStreamCatchup:  9104,      indexStreamInit:  9103
19:21:33.018 INFO ▶                       kv: 11210,                 mgmt:  8091,                 n1ql:  8093
19:21:33.018 INFO ▶                     capi:  8092,                  fts:  8094,              ftsGRPC:  9130
19:21:33.018 INFO ▶               indexAdmin:  9100,            indexScan:  9101,     indexStreamMaint:  9105
19:21:33.018 INFO ▶                projector:  9999
19:21:33.018 INFO ▶ Fetching config from `http://HOST:8091`
19:21:33.026 INFO ▶ Received cluster configuration, nodes list:
[
{
"addressFamily": "inet",
"addressFamilyOnly": false,
"clusterCompatibility": 458753,
"clusterMembership": "active",
"configuredHostname": "127.0.0.1:8091",
"couchApiBase": "http://HOST:8092/",
"cpuCount": 2,
"externalListeners": [
{
"afamily": "inet",
"nodeEncryption": false
}
],
"hostname": "HOST:8091",
"interestingStats": {
"cmd_get": 0,
"couch_docs_actual_disk_size": 10066127,
"couch_docs_data_size": 142759,
"couch_spatial_data_size": 0,
"couch_spatial_disk_size": 0,
"couch_views_actual_disk_size": 0,
"couch_views_data_size": 0,
"curr_items": 50,
"curr_items_tot": 50,
"ep_bg_fetched": 0,
"get_hits": 0,
"index_data_size": 166785,
"index_disk_size": 3702784,
"mem_used": 19090064,
"ops": 0,
"vb_active_num_non_resident": 0,
"vb_replica_curr_items": 0
},
"mcdMemoryAllocated": 6357,
"mcdMemoryReserved": 6357,
"memoryFree": 7498797056,
"memoryTotal": 8333029376,
"nodeEncryption": false,
"nodeHash": 121769150,
"nodeUUID": "ac79d98f916c4fda5362a23ba51aaef5",
"os": "x86_64-pc-linux-gnu",
"otpNode": "ns_1@127.0.0.1",
"ports": {
"direct": 11210,
"distTCP": 21100,
"distTLS": 21150
},
"recoveryType": "none",
"services": [
"fts",
"index",
"kv",
"n1ql"
],
"status": "healthy",
"systemStats": {
"allocstall": 0,
"cpu_cores_available": 2,
"cpu_stolen_rate": 0.2023267577137076,
"cpu_utilization_rate": 2.470126375046816,
"mem_free": 7498797056,
"mem_limit": 8333029376,
"mem_total": 8333029376,
"swap_total": 0,
"swap_used": 0
},
"thisNode": true,
"uptime": "274458",
"version": "7.1.1-3175-community"
}
]
19:21:33.031 INFO ▶ Successfully connected to Key Value service at `HOST:11210`
19:21:33.086 INFO ▶ Successfully connected to Management service at `HOST:8091`
19:21:33.090 INFO ▶ Successfully connected to Views service at `HOST:8092`
19:21:33.093 INFO ▶ Successfully connected to Query service at `HOST:8093`
19:21:33.096 INFO ▶ Successfully connected to Search service at `HOST:8094`
19:21:33.096 WARN ▶ Could not test Analytics service on `HOST` as it was not in the config
19:21:33.111 INFO ▶ Memd Nop Pinged `HOST:11210` 10 times, 0 errors, 0ms min, 1ms max, 1ms mean

Tried to use 3.4.6 version and the issue still occurred.

Here is also sdk-doctor logs from Couchbase EC2 public IP

.\sdk-doctor-windows.exe diagnose couchbases://HOST/BUCKET -u USERNAME -p PASSWORD
|====================================================================|
|          ___ ___  _  __   ___   ___   ___ _____ ___  ___           |
|         / __|   \| |/ /__|   \ / _ \ / __|_   _/ _ \| _ \          |
|         \__ \ |) | ' <___| |) | (_) | (__  | || (_) |   /          |
|         |___/___/|_|\_\  |___/ \___/ \___| |_| \___/|_|_\          |
|                                                                    |
|====================================================================|

Note: Diagnostics can only provide accurate results when your cluster
 is in a stable state.  Active rebalancing and other cluster configuration
 changes can cause the output of the doctor to be inconsistent or in the
 worst cases, completely incorrect.

18:22:48.509 INFO ▶ Parsing connection string `couchbases://HOST/BUCKET`
18:22:48.510 INFO ▶ Connection string specifies to use secured connections
18:22:48.510 INFO ▶ Connection string identifies the following CCCP endpoints:
18:22:48.510 INFO ▶   1. HOST:11207
18:22:48.510 INFO ▶ Connection string identifies the following HTTP endpoints:
18:22:48.510 INFO ▶   1. HOST:18091
18:22:48.511 INFO ▶ Connection string specifies bucket `BUCKET`
18:22:48.511 WARN ▶ No certificate authority file specified (--tls-ca), skipping server certificate verification for this run.
18:22:48.513 WARN ▶ Your connection string specifies only a single host.  You should consider adding additional static nodes from your cluster to this list to improve your applications fault-tolerance
18:22:48.514 INFO ▶ Performing DNS lookup for host `3.234.167.190`
18:22:48.514 INFO ▶ Attempting to connect to cluster via CCCP
18:22:48.515 INFO ▶ Attempting to fetch config via cccp from `HOST:11207`
18:22:50.515 ERRO ▶ Failed to fetch configuration via cccp from `HOST:11207` (error: dial tcp 3.234.167.190:11207: i/o timeout)
18:22:50.515 INFO ▶ Attempting to connect to cluster via HTTP (Terse)
18:22:50.516 INFO ▶ Attempting to fetch terse config via http from `HOST:18091`
18:22:52.519 ERRO ▶ Failed to fetch terse configuration via http from `HOST:18091` (error: Get "http://3.234.167.190:18091/pools/default/b/BUCKET": context deadline exceeded (Client.Timeout exceeded while awaiting headers))
18:22:52.519 INFO ▶ Attempting to connect to cluster via HTTP (Full)
18:22:52.520 INFO ▶ Failed to connect via HTTP (Full), as it is not yet supported by the doctor
18:22:52.521 INFO ▶ Selected the following network type:
18:22:52.523 ERRO ▶ All endpoints specified by your connection string were unreachable, further cluster diagnostics are not possible
18:22:52.524 INFO ▶ Diagnostics completed

Summary:
←[33m[WARN]←[0m No certificate authority file specified (--tls-ca), skipping server certificate verification for this run.
←[33m[WARN]←[0m Your connection string specifies only a single host.  You should consider adding additional static nodes from your cluster to this list to improve your applications fault-tolerance
←[31m[ERRO]←[0m Failed to fetch configuration via cccp from `HOST:11207` (error: dial tcp 3.234.167.190:11207: i/o timeout)
←[31m[ERRO]←[0m Failed to fetch terse configuration via http from `HOST:18091` (error: Get "http://HOST:18091/pools/default/b/BUCKET": context deadline exceeded (Client.Timeout exceeded while awaiting headers))
←[31m[ERRO]←[0m All endpoints specified by your connection string were unreachable, further cluster diagnostics are not possible

Found multiple issues, see listing above.

Maybe it will help

Ok. Query execution is usually/mostly indepedent from getting the configuration. It does need to succeed once so the SDK is aware of the query nodes, but after that it is (mostly) not needed. To get more information on why the queries are taking long to execute, either examine/print the metrics from the query result and/or enable Threshold reporting. Looking in the logs on the query nodes may also help.

The second SDK Doctor you have specified cochbases://… which indicates tls, but you’re not using tls (according to the output from the first SDK doctor), so that’s not going to work.

What is the “simple select query”? How long does it take to execute in the web console?

  • Mike

Mike thanks for your response. I also have the same issues with other lambda functions, key-value operation, and upsert not only query.

Query which I’m using looks like:

DELETE FROM `BUCKET` AS s WHERE s.id = 'ID' AND s.`type` = 'TYPE'

In web console it takes 2.1ms.

Previously you told that

couchbase will try the management service if it fails to get the config from the kv service. And it will also retry

What are the possible reasons for this behavior?

If the SDK has successfully retrieved the config once - either from the kv service or the management (http) service, then failing to retrieve it again will not result in any delays using the query service – unless all the query nodes in the config that it has have become in operable. Is that a possibility? If it’s not a possibility, then this isn’t the issue. Getting the metrics or Threshold reporting would help determine if getting the configuration might be an issue.

What are the possible reasons for this behavior?

It could be anything from the kv port not being accessible to the client to the kv node having gone up in flames. But first it should be determined that this is the cause of the queries being slow. The query metrics and/or the Threshold logging will help.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.