Hello all,
We are experiencing intermittent Couchbase time-outs in our Node.js application. Our Node.js Couchbase module version is 2.0.5, and the server runs Couchbase 2.2.0. We have two Node.js servers connecting to a cluster of two Couchbase servers. Time-outs occur in both servers, under minimal or even no load at all.
Our API connects to Couchbase like this:
var cluster = new couchbase.Cluster('couchbase://dbserver1.local,dbserver2.local');
var bucket = cluster.openBucket('some-bucket', function (err) {
// Say something if there's an error, otherwise we are cool
});
I managed to get some logs by setting LCB_LOGLEVEL=5 and LCB_CNTL_DETAILED_ERRCODES=1.
This is what happens during a timeout:
245730974ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 0 ms
245730974ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 0 ms
245730974ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 0 ms
### These are the lines logged rigth when the error occurs
245730974ms [I0] {22131} [WARN] (server - L:374) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Failing command (pkt=0x22ad1b0, opaque=15834, opcode=0x0) with error 0x17
245730976ms [I0] {22131} [TRACE] (confmon - L:292) Start refresh requested
245730977ms [I0] {22131} [ERROR] (server - L:457) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Server timed out. Some commands have failed
245730977ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 2500 ms
245730977ms [I0] {22131} [INFO] (confmon - L:166) Not applying configuration received via HTTP. No changes detected. A.rev=-1, B.rev=-1
245730977ms [I0] {22131} [TRACE] (confmon - L:271) Current provider is HTTP
245731502ms [I1] {22133} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1e688a0,IX=0) Scheduling next timeout for 2500 ms
245731502ms [I0] {22133} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1cc17b0,IX=1) Scheduling next timeout for 2500 ms
245731502ms [I0] {22133} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1cb5f90,IX=0) Scheduling next timeout for 2500 ms
245731502ms [I1] {22133} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1cb6ff0,IX=1) Scheduling next timeout for 2500 ms
245731659ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x23fab70,IX=1) Scheduling next timeout for 2500 ms
245731659ms [I1] {22131} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x2402d80,IX=1) Scheduling next timeout for 2500 ms
245731659ms [I1] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23fda10,IX=0) Scheduling next timeout for 2500 ms
245732222ms [I1] {22137} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1a9aee0,IX=0) Scheduling next timeout for 2500 ms
245732882ms [I0] {22135} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1de7d70,IX=0) Scheduling next timeout for 2500 ms
245732882ms [I1] {22135} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1ea5e60,IX=0) Scheduling next timeout for 2500 ms
245732882ms [I1] {22135} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1f0f920,IX=1) Scheduling next timeout for 2500 ms
245732882ms [I0] {22135} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1ea4fd0,IX=1) Scheduling next timeout for 2500 ms
245732887ms [I1] {22137} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1aa01a0,IX=1) Scheduling next timeout for 2500 ms
### And these happen slightly later after the time-out error
245732976ms [I0] {22131} [ERROR] (htconfig - L:375) <10.99.24.24:8091> HTTP Provider timed out waiting for I/O
245732976ms [I0] {22131} [DEBUG] (ioctx - L:151) <10.99.24.24:8091> (CTX=0x109d9750,bc_http) Destroying. PND=1,ENT=0,SORC=1
245732976ms [I0] {22131} [INFO] (connection - L:441) <10.99.24.25:8091> (SOCK=0x10c20c90) Starting. Timeout=2000000us
245732978ms [I0] {22131} [TRACE] (connection - L:335) <10.99.24.25:8091> (SOCK=0x10c20c90) Received completion handler. Status=0. errno=0
245732978ms [I0] {22131} [INFO] (connection - L:116) <10.99.24.25:8091> (SOCK=0x10c20c90) Connected
245732978ms [I0] {22131} [DEBUG] (htconfig - L:346) Successfuly connected to REST API 10.99.24.25:8091
245732978ms [I0] {22131} [DEBUG] (ioctx - L:101) <10.99.24.25:8091> (CTX=0x10c21d30,unknown) Pairing with SOCK=0x10c20c90
245732980ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 198 bytes on HTTP stream
245732980ms [I0] {22131} [WARN] (htconfig - L:155) <10.99.24.25:8091> Got 404 on config stream. Assuming terse URI not supported on cluster
245732981ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 10 bytes on HTTP stream
245732982ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 224 bytes on HTTP stream
245732990ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 5792 bytes on HTTP stream
245732990ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 1448 bytes on HTTP stream
245732991ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 890 bytes on HTTP stream
245732991ms [I0] {22131} [DEBUG] (htconfig - L:260) <10.99.24.25:8091> Generation 2 -> 3
245732991ms [I0] {22131} [INFO] (confmon - L:174) Setting new configuration. Received via HTTP
245732991ms [I0] {22131} [INFO] (bootstrap - L:152) Got async step callback..
245732992ms [I0] {22131} [DEBUG] (bootstrap - L:55) Instance configured!
245732992ms [I0] {22131} [INFO] (newconfig - L:107) Config Diff: [ vBuckets Modified=1024 ], [Sequence Changed=1]
245732992ms [I0] {22131} [INFO] (newconfig - L:206) Reusing server 10.99.24.24:11210 (0x23e3e70). OldIndex=0. NewIndex=1
245732992ms [I0] {22131} [INFO] (newconfig - L:206) Reusing server 10.99.24.25:11210 (0x23fab70). OldIndex=1. NewIndex=0
And this is what occurs when things, suddenly, start working again:
822938ms [I1] {6526} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x2bbe420,IX=1) Scheduling next timeout for 2500 ms
822938ms [I1] {6526} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x2bbf010,IX=0) Scheduling next timeout for 2500 ms
781071ms [I1] {6583} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x16816b0,IX=0) Scheduling next timeout for 2500 ms
781255ms [I1] {6583} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1686170,IX=1) Scheduling next timeout for 2500 ms
247330599ms [I1] {22137} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1a9aee0,IX=0) Scheduling next timeout for 2500 ms
247330854ms [I1] {22137} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1aa01a0,IX=1) Scheduling next timeout for 2500 ms
247331205ms [I0] {22137} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1aae750,IX=1) Scheduling next timeout for 2500 ms
247331647ms [I0] {22137} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1a97ee0,IX=0) Scheduling next timeout for 2500 ms
247331784ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 2500 ms
825255ms [I0] {6526} [DEBUG] (lcbio_mgr - L:416) <10.99.24.24:11210> (HE=0x33e6440) Creating new connection because none are available in the pool
825255ms [I0] {6526} [DEBUG] (lcbio_mgr - L:321) <10.99.24.24:11210> (HE=0x33e6440) Starting connection on I=0x2f6b000
825255ms [I0] {6526} [INFO] (connection - L:441) <10.99.24.24:11210> (SOCK=0x2f6b160) Starting. Timeout=2500000us
825256ms [I0] {6526} [TRACE] (connection - L:335) <10.99.24.24:11210> (SOCK=0x2f6b160) Received completion handler. Status=0. errno=0
825256ms [I0] {6526} [INFO] (connection - L:116) <10.99.24.24:11210> (SOCK=0x2f6b160) Connected
825256ms [I0] {6526} [DEBUG] (lcbio_mgr - L:271) <10.99.24.24:11210> (HE=0x33e6440) Received result for I=0x2f6b000,C=0x2f6b160; E=0x0
825256ms [I0] {6526} [DEBUG] (lcbio_mgr - L:223) <10.99.24.24:11210> (HE=0x33e6440) Assigning R=0x342c5b0 SOCKET=0x2f6b160
825256ms [I0] {6526} [TRACE] (server - L:487) <10.99.24.24:11210> (SRV=0x2bbeab0) Session not yet negotiated. Negotiating
825256ms [I0] {6526} [DEBUG] (ioctx - L:101) <10.99.24.24:11210> (CTX=0x2f3f060,unknown) Pairing with SOCK=0x2f6b160
825259ms [I0] {6526} [DEBUG] (ioctx - L:151) <10.99.24.24:11210> (CTX=0x2f3f060,sasl) Destroying. PND=0,ENT=1,SORC=1
825259ms [I0] {6526} [DEBUG] (ioctx - L:101) <10.99.24.24:11210> (CTX=0x2f18110,unknown) Pairing with SOCK=0x2f6b160
825259ms [I0] {6526} [DEBUG] (server - L:506) <10.99.24.24:11210> (SRV=0x2bbeab0,IX=0) Setting initial timeout=2495ms
825266ms [I0] {6526} [DEBUG] (lcbio_mgr - L:416) <10.99.24.25:11210> (HE=0x3095700) Creating new connection because none are available in the pool
825266ms [I0] {6526} [DEBUG] (lcbio_mgr - L:321) <10.99.24.25:11210> (HE=0x3095700) Starting connection on I=0x3095c30
825266ms [I0] {6526} [INFO] (connection - L:441) <10.99.24.25:11210> (SOCK=0x2fb8960) Starting. Timeout=2500000us
825268ms [I0] {6526} [TRACE] (connection - L:335) <10.99.24.25:11210> (SOCK=0x2fb8960) Received completion handler. Status=0. errno=0
825268ms [I0] {6526} [INFO] (connection - L:116) <10.99.24.25:11210> (SOCK=0x2fb8960) Connected
825268ms [I0] {6526} [DEBUG] (lcbio_mgr - L:271) <10.99.24.25:11210> (HE=0x3095700) Received result for I=0x3095c30,C=0x2fb8960; E=0x0
825268ms [I0] {6526} [DEBUG] (lcbio_mgr - L:223) <10.99.24.25:11210> (HE=0x3095700) Assigning R=0x3350de0 SOCKET=0x2fb8960
825268ms [I0] {6526} [TRACE] (server - L:487) <10.99.24.25:11210> (SRV=0x2d19310) Session not yet negotiated. Negotiating
825268ms [I0] {6526} [DEBUG] (ioctx - L:101) <10.99.24.25:11210> (CTX=0x2f51b20,unknown) Pairing with SOCK=0x2fb8960
825272ms [I0] {6526} [DEBUG] (ioctx - L:151) <10.99.24.25:11210> (CTX=0x2f51b20,sasl) Destroying. PND=0,ENT=1,SORC=1
825272ms [I0] {6526} [DEBUG] (ioctx - L:101) <10.99.24.25:11210> (CTX=0x2f184b0,unknown) Pairing with SOCK=0x2fb8960
825272ms [I0] {6526} [DEBUG] (server - L:506) <10.99.24.25:11210> (SRV=0x2d19310,IX=1) Setting initial timeout=2494ms
Following advice from other posts, we switched to the Couchbase protocol (from http://server:8091 to couchbase://server), but we also noticed that it fails and falls back to HTTP. From the logs:
12ms [I0] {8937} [INFO] (connection - L:116) <jmbadb06.jmba.local:11210> (SOCK=0x18b0ad0) Connected
12ms [I0] {8937} [DEBUG] (lcbio_mgr - L:271) <jmbadb06.jmba.local:11210> (HE=0x187ced0) Received result for I=0x1878a40,C=0x18b0ad0; E=0x0
12ms [I0] {8937} [DEBUG] (lcbio_mgr - L:223) <jmbadb06.jmba.local:11210> (HE=0x187ced0) Assigning R=0x187c520 SOCKET=0x18b0ad0
12ms [I0] {8937} [DEBUG] (ioctx - L:101) <jmbadb06.jmba.local:11210> (CTX=0x187cd30,unknown) Pairing with SOCK=0x18b0ad0
15ms [I0] {8937} [DEBUG] (ioctx - L:151) <jmbadb06.jmba.local:11210> (CTX=0x187cd30,sasl) Destroying. PND=0,ENT=1,SORC=1
15ms [I0] {8937} [DEBUG] (ioctx - L:101) <jmbadb06.jmba.local:11210> (CTX=0x186f250,unknown) Pairing with SOCK=0x18b0ad0
16ms [I0] {8937} [WARN] (cccp - L:392) <jmbadb06.jmba.local:11210> CCCP Packet responded with 0x81; nkey=0, nbytes=0, cmd=0xb5, seq=0xf00d
16ms [I0] {8937} [DEBUG] (ioctx - L:151) <jmbadb06.jmba.local:11210> (CTX=0x186f250,bc_cccp) Destroying. PND=0,ENT=1,SORC=1
16ms [I0] {8937} [INFO] (lcbio_mgr - L:491) <jmbadb06.jmba.local:11210> (HE=0x187ced0) Placing socket back into the pool. I=0x1878a40,C=0x18b0ad0
16ms [I0] {8937} [INFO] (confmon - L:202) Provider 'CCCP' failed
16ms [I0] {8937} [DEBUG] (confmon - L:236) Will try next provider in 0us
16ms [I0] {8937} [TRACE] (confmon - L:271) Current provider is HTTP
Any idea or suggestion will be more than welcome. Thanks!