Unable to connect to server since client 2.7.3

Hi,
situation:

  • 3-node cluster with server 2.1.1
  • client libcouchbase 2.7.3 (and 2.7.4)

simplified the tests to using “cbc cat”, on the master node. The apps that use the libraries behave the same way.

#case 1: FAIL
When calling 2.7.3 (2.7.4 gives strictly comparable output):

# cbc-cat MY_KEY -U "couchbase://couch2.mydomain.com/session?dnssrv=off&detailed_errcodes=on&bootstrap_on=http" -v
0ms [I0] {23898} [INFO] (instance - L:402) Version=2.7.3, Changeset=095afbb1a83bfef8def6d50bf50ee494b6c3c67f
0ms [I0] {23898} [INFO] (instance - L:403) Effective connection string: couchbase://couch2.mydomain.com/session?dnssrv=off&detailed_errcodes=on&bootstrap_on=http&console_log_level=2&. Bucket=session
0ms [I0] {23898} [INFO] (connection - L:458) <couch2.mydomain.com:8091> (SOCK=0x17711c0) Starting. Timeout=2000000us
3ms [I0] {23898} [INFO] (connection - L:130) <couch2.mydomain.com:8091> (SOCK=0x17711c0) Connected established
4ms [I0] {23898} [WARN] (htconfig - L:128) <couch2.mydomain.com:8091> Got 404 on config stream. Assuming terse URI not supported on cluster
36ms [I0] {23898} [INFO] (confmon - L:153) Setting new configuration. Received via HTTP
36ms [I0] {23898} [INFO] (connection - L:458) <couch2.mydomain.com:11210> (SOCK=0x178aef0) Starting. Timeout=2500000us
37ms [I0] {23898} [INFO] (connection - L:130) <couch2.mydomain.com:11210> (SOCK=0x178aef0) Connected established
38ms [I0] {23898} [ERROR] (negotiation - L:131) <couch2.mydomain.com:11210> (SASLREQ=0x178b000) Error: 0x2d, IO Error
38ms [I0] {23898} [ERROR] (server - L:467) <NOHOST:NOPORT> (SRV=0x17738f0,IX=0) Connection attempt failed. Received LCB_ESOCKSHUTDOWN (0x2D) from libcouchbase, received 0 from operating system
38ms [I0] {23898} [INFO] (bootstrap - L:164) Not requesting a config refresh because of throttling parameters. Next refresh possible in 9998ms or 99 errors. See LCB_CNTL_CONFDELAY_THRESH and LCB_CNTL_CONFERRTHRESH to modify the throttling settings
53ms [I0] {23898} [INFO] (connection - L:458) <couch2.mydomain.com:11210> (SOCK=0x178aef0) Starting. Timeout=2500000us
54ms [I0] {23898} [INFO] (connection - L:130) <couch2.mydomain.com:11210> (SOCK=0x178aef0) Connected established
54ms [I0] {23898} [ERROR] (negotiation - L:131) <couch2.mydomain.com:11210> (SASLREQ=0x178afb0) Error: 0x2d, IO Error
54ms [I0] {23898} [ERROR] (server - L:467) <NOHOST:NOPORT> (SRV=0x17738f0,IX=0) Connection attempt failed. Received LCB_ESOCKSHUTDOWN (0x2D) from libcouchbase, received 0 from operating system
54ms [I0] {23898} [INFO] (bootstrap - L:164) Not requesting a config refresh because of throttling parameters. Next refresh possible in 9981ms or 98 errors. See LCB_CNTL_CONFDELAY_THRESH and LCB_CNTL_CONFERRTHRESH to modify the throttling settings
84ms [I0] {23898} [INFO] (connection - L:458) <couch2.mydomain.com:11210> (SOCK=0x178aef0) Starting. Timeout=2500000us
85ms [I0] {23898} [INFO] (connection - L:130) <couch2.mydomain.com:11210> (SOCK=0x178aef0) Connected established
86ms [I0] {23898} [ERROR] (negotiation - L:131) <couch2.mydomain.com:11210> (SASLREQ=0x17731a0) Error: 0x2d, IO Error
86ms [I0] {23898} [ERROR] (server - L:467) <NOHOST:NOPORT> (SRV=0x17738f0,IX=0) Connection attempt failed. Received LCB_ESOCKSHUTDOWN (0x2D) from libcouchbase, received 0 from operating system
86ms [I0] {23898} [INFO] (bootstrap - L:164) Not requesting a config refresh because of throttling parameters. Next refresh possible in 9950ms or 97 errors. See LCB_CNTL_CONFDELAY_THRESH and LCB_CNTL_CONFERRTHRESH to modify the throttling settings
131ms [I0] {23898} [INFO] (connection - L:458) <couch2.mydomain.com:11210> (SOCK=0x178aef0) Starting. Timeout=2500000us
...
2373ms [I0] {23898} [INFO] (bootstrap - L:164) Not requesting a config refresh because of throttling parameters. Next refresh possible in 7662ms or 82 errors. See LCB_CNTL_CONFDELAY_THRESH and LCB_CNTL_CONFERRTHRESH to modify the throttling settings
2536ms [I0] {23898} [WARN] (retryq - L:142) Failing command (seq=0) from retry queue with error code 0x2d
MY_KEY The remote host closed the connection (0x2d)

case 2: OK

when using 2.7.1, the connection works (same cluster, same key, removed the non supported options in the connection string):

# cbc-cat MY_KEY -U "couchbase://couch2.mydomain.com/session?bootstrap_on=http" -v
0ms [I0] {1570} [INFO] (instance - L:401) Version=2.7.1, Changeset=63c54b1b63db427ab3cf9a131c08ad591044b798
0ms [I0] {1570} [INFO] (instance - L:402) Effective connection string: couchbase://couch2.mydomain.com/session?bootstrap_on=http&console_log_level=2&. Bucket=session
4ms [I0] {1570} [INFO] (instance - L:135) DNS SRV lookup failed: DNS/Hostname lookup failed
4ms [I0] {1570} [INFO] (connection - L:450) <couch2.mydomain.com:8091> (SOCK=0x79bbb0) Starting. Timeout=2000000us
6ms [I0] {1570} [INFO] (connection - L:116) <couch2.mydomain.com:8091> (SOCK=0x79bbb0) Connected
184ms [I0] {1570} [WARN] (htconfig - L:128) <couch2.mydomain.com:8091> Got 404 on config stream. Assuming terse URI not supported on cluster
216ms [I0] {1570} [INFO] (confmon - L:153) Setting new configuration. Received via HTTP
216ms [I0] {1570} [INFO] (connection - L:450) <couch2.mydomain.com:11210> (SOCK=0x7af3c0) Starting. Timeout=2500000us
217ms [I0] {1570} [INFO] (connection - L:116) <couch2.mydomain.com:11210> (SOCK=0x7af3c0) Connected
MY_KEY CAS=0x17e673a1ea37234, Flags=0x3000002. Size=3
bla

#Case 3: OK
When using 2.7.3 against a single-server instance (also 2.1.1), that uses IP address configured on the server instance and not a full domain name, the connection works:

#cbc-cat MY_KEY -U "couchbase://127.0.0.1/default?dnssrv=off&detailed_errcodes=on&bootstrap_on=http" -v
0ms [I0] {10566} [INFO] (instance - L:402) Version=2.7.3, Changeset=095afbb1a83bfef8def6d50bf50ee494b6c3c67f
0ms [I0] {10566} [INFO] (instance - L:403) Effective connection string: couchbase://127.0.0.1/default?dnssrv=off&detailed_errcodes=on&bootstrap_on=http&console_log_level=2&. Bucket=default
0ms [I0] {10566} [INFO] (connection - L:458) <127.0.0.1:8091> (SOCK=0x1919120) Starting. Timeout=2000000us
0ms [I0] {10566} [INFO] (connection - L:130) <127.0.0.1:8091> (SOCK=0x1919120) Connected established
1ms [I0] {10566} [WARN] (htconfig - L:128) <127.0.0.1:8091> Got 404 on config stream. Assuming terse URI not supported on cluster
42ms [I0] {10566} [INFO] (confmon - L:153) Setting new configuration. Received via HTTP
42ms [I0] {10566} [INFO] (connection - L:458) <127.0.0.1:11210> (SOCK=0x1930930) Starting. Timeout=2500000us
42ms [I0] {10566} [INFO] (connection - L:130) <127.0.0.1:11210> (SOCK=0x1930930) Connected established
MY_KEY CAS=0x2a29017b986, Flags=0x0. Size=3
bla

I MUST use “dnssrv=off” here, or the bootstrap fails.

remarks:

For case 1:

  • “dnssrv=off” makes no difference

  • using an IP address in the connection string makes no difference. Once bootstrapped, a full DNS name is used.

  • removing the options to be like case 2 lowers the amount of info printed, but does not make the connection work.

  • forcing dnssrv (couchbase+dnssrv://…, and removing dsnsrv=off) does not work, as we get the message “Not scheduling HTTP provider since no nodes have been configured for HTTP bootstrap”, which is strange, as without that, the bootstrap works on http.

  • and sorry, I really prefer to avoid moving up from 2.1.1 right now.

#preliminary conclusion:
The cause is maybe in the DNS lookups.

Interesting. Can you increase the log level even more (put an extra 3-4 v's there)?

How we do DNS lookups hasn’t really changed other than dnssrv=off. BTW - is the master node (in case 1) also really localhost?

The client is sending the server something it doesn’t like, and the server is disconnecting us. I’m guessing this is due to sending HELLO before authentication. In your case, I can try to disable the entire HELLO subsystem because it’s not supported on 2.1 anyway. You can tweak src/mcserver/negotiate.cc a bit and try to do printf debugging to see where it fails.

Specifically, I believe in 2.7.3, we started sending a bunch of commands before auth. These commands should work just fine, but I’m guessing that maybe in this case the server is not prepared to handle any input unless the user is first authorized. This doesn’t explain why it does work when you pass in the IP rather than the hostname. You should be connected to the same host, after all.

You probably know this already, but the server version you’re using is no longer supported. That being said, I’d like to know why the latest changes break connectivity here.

Quick reply before firing up a dedicated lab machine for these tests versions:
Case 1 and 2 were on a remote client and a full cluster. Case 3 was on a separate, single server install locally, and without any DNS.

Tomorrow I can try vvvv or tcpdump, the latter being more authoritative probably.

Hans

Just want to emphasise this a bit more - 2.1.1 went End of Life over 2 years ago in December 2014 (Support Policy | Couchbase). As such, there’s absolutely no guarantee that newer libcouchbase versions will work with it (when 2.1.1 was released - July 2013 - the latest libcouchbase was 2.0.7 (!).

Indeed, have become painfully aware. Until now that cluster was rock solid. Installed, and only needed to take it into maintenance once in 4 years to upgrade the disks. Will look at upgrading, but that will take some time as migrating from 2.1 directly to 4.x is not possible. If I recall well, I have to go through 2 intermediate versions, plus forced OS upgrades, and we might need another disk upgrade, as 3.x+ has shown to be having an approximately 3 times higher I/O load. And it all has to go through QA, and it all must be done without downtime, under load/live, and without data loss. In the meantime, I will try to debug this situation, will give feedback, and stick to 2.7.0 as client.

Results of tests follow below.

TLDR;
The suspicion that the HELLO is making it break is correct.

Here is the output of one of the connect fails of cbc-cat on console log level 9, and using 2.7.4.

37ms [I0] {30457} [DEBUG] (bootstrap - L:56) Instance configured!
38ms [I0] {30457} [DEBUG] (lcbio_mgr - L:413) <couch2.mydomain.com:11210> (HE=0xe1a540) Creating new connection because none are available in the pool
38ms [I0] {30457} [TRACE] (lcbio_mgr - L:322) <couch2.mydomain.com:11210> (HE=0xe1a540) New pool entry: I=0xe03050
38ms [I0] {30457} [INFO] (connection - L:458) <couch2.mydomain.com:11210> (SOCK=0xe031f0) Starting. Timeout=2500000us
39ms [I0] {30457} [DEBUG] (connection - L:223) <couch2.mydomain.com:11210> (SOCK=0xe031f0) Created new socket with FD=4
39ms [I0] {30457} [TRACE] (connection - L:325) <couch2.mydomain.com:11210> (SOCK=0xe031f0) Scheduling I/O watcher for asynchronous connection completion.
39ms [I0] {30457} [INFO] (connection - L:130) <couch2.mydomain.com:11210> (SOCK=0xe031f0) Connected established
39ms [I0] {30457} [DEBUG] (connection - L:137) <couch2.mydomain.com:11210> (SOCK=0xe031f0) Successfuly set TCP_NODELAY
39ms [I0] {30457} [DEBUG] (lcbio_mgr - L:281) <couch2.mydomain.com:11210> (HE=0xe1a540) Received result for I=0xe03050,C=(nil); E=0x0
39ms [I0] {30457} [DEBUG] (lcbio_mgr - L:241) <couch2.mydomain.com:11210> (HE=0xe1a540) Assigning R=0xe71e40 SOCKET=0xe031f0
39ms [I0] {30457} [TRACE] (server - L:595) <couch2.mydomain.com:11210> (SRV=0xe03830) Session not yet negotiated. Negotiating
39ms [I0] {30457} [DEBUG] (ioctx - L:101) <couch2.mydomain.com:11210> (CTX=0xe1acd0,unknown) Pairing with SOCK=0xe031f0
40ms [I0] {30457} [ERROR] (negotiation - L:131) <couch2.mydomain.com:11210> (SASLREQ=0xe1ab20) Error: 0x2d, IO Error
40ms [I0] {30457} [ERROR] (server - L:583) <NOHOST:NOPORT> (SRV=0xe03830,IX=0) Connection attempt failed. Received LCB_ESOCKSHUTDOWN (0x2D) from libcouchbase, received 0 from operating system
40ms [I0] {30457} [DEBUG] (retryq - L:312) Adding PKT=0xe03160 to retry queue. Try count=1
40ms [I0] {30457} [TRACE] (retryq - L:179) Next tick in 14 ms
40ms [I0] {30457} [INFO] (bootstrap - L:164) Not requesting a config refresh because of throttling parameters. Next refresh possible in 9997ms or 99 errors. See LCB_CNTL_CONFDELAY_THRESH and LCB_CNTL_CONFERRTHRESH to modify the throttling settings
40ms [I0] {30457} [DEBUG] (ioctx - L:151) <couch2.mydomain.com:11210> (CTX=0xe1acd0,sasl) Destroying context. Pending Writes=0, Entered=false, Socket Refcount=1

Here is the dump from the network traffic.

After the hello, the server replies with a simple ACK (as it should), and then a FIN.

Conclusion: The server decides to close the connection upon receiving the hello.

When I adapt the following code:

diff --git a/src/mcserver/negotiate.cc b/src/mcserver/negotiate.cc
index 2c986b1..3e5647a 100644
--- a/src/mcserver/negotiate.cc
+++ b/src/mcserver/negotiate.cc
@@ -627,7 +627,9 @@ SessionRequestImpl::start(lcbio_SOCKET *sock) {
         return;
     }
 
-    send_hello();
+    lcb::MemcachedRequest hdr(PROTOCOL_BINARY_CMD_SASL_LIST_MECHS);
+    lcbio_ctx_put(ctx, hdr.data(), hdr.size());
+    //send_hello();
     LCBIO_CTX_RSCHEDULE(ctx, 24);
 }

The connection and the cbc-cat work OK.
(I know that this is not a definitive solution, as the hello has been removed completely now, and might be needed for other situations)

If you want to make this work on older servers (and I have a strong suspicion I am not the only one who is using older versions), I suggest testing before sending hello. I can help refactoring and testing this if you want.

I think I’m going to add a server_version connection string parameter which would be shorthand for a bunch of optimizations or fixes suited for a specific server release.

For example, in this case I can simply provide a hello=false option in the connection string (2.x doesn’t support it anyway!).

Great!
Wouldn’t it be possible to pick up from the bootstrap?
And how would this work in mixed environments? Would any of the optimized-out options for one version be mandatory for another version?
I have seen a cluster made up of 2.x and 3.x and 4.x servers, all together (not that that would be recommended or stable I imagine, but it could happen while doing a live migration)

I would really avoid heuristics if possible. The fewer implicit assumptions a piece of software makes, the more stable it is, IMO.

There is also a CouchbaseMock java-based test server, that in some respects behaves like the upcoming 5.0 release, and in others still uses a 1.x API. Yes, I could change the mock to only expose the ‘newer’ APIs, but I believe a piece of client software should be versatile and flexible as possible.

There is a bucketCapabilities array in the bootstrap response, however that tells us what each bucket supports, not what each node supports.

EDIT: I see what you mean though… I essentially ‘broke’ compat when I made HELLO take place before authentication, so trying to restore this compat may take higher precedence, even if it’s done heuristically. I will search for a better solution.

EDIT2: I guess the question is whether we should support an outdated environment or a non-standard environment. The former is where HELLO causes a disconnect, the latter is where our heuristics tell us that HELLO is not supported, but it in fact is.