Couchbase intermittent connections

Anyone able to provide me any insight on why couchbase server having intermittent connection on local machines?

Port 8091 and 18091

If your browser cannot connect to those ports, it would indicate nothing is listening on them (i.e. Couchbase Management server is not running). If you are confident that it should be running but you can’t connect to it, check the sever logs.

All other services work expect for the service on port 8091 and 18091

Right. It could be just the management server. You could also try accessing with curl to check that the browser hasn’t cached an bad response.

I’ve cleared cached and also tried several different browsers.
What would be the command for curl?

curl http://hostname:8091

I’m also facing the same issue. Couchbase works fine for most of the time but, both ports 8091 and 18091 go down suddenly and randomly. All other ports like 8092, etc remain accessible. I’ve tried testing this using curl, netstat, etc.

The server doesn’t go down but, just the management server gets blocked. Is there any alternative for using the managment server? Maybe, a desktop client instead of browser?

This is a local installation of couchbase-server-enterprise_7.2.0-windows_amd64 on a Windows 11 in a corporate network. I’m trying to access it from the same machine. Everyone in my team is facing the same issue.

Does it hang/timeout? Or does it give an error (“Connection refused or similar”)?

The server logs (on my machine, anyway) are under “Library/Application Support/Couchbase/var/lib/couchbase/logs”. If the management process is exiting and restarting there might be information in babysitter.log.

Apologies for the delay in response.

These are the curl responses, executed at various time after the admin console goes unavailable–

c:>curl http://localhost:8091
<-!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML 2.0//EN”><-html><-title>301 Moved Permanently<-h1>Moved Permanently<-p>The document has moved <-a href="http://localhost:8091/ui/index.html>here.

c:>curl http://localhost:8091/
<-!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML 2.0//EN”><-html><-head><-title>301 Moved Permanently<-body><-h1>Moved Permanently<-p>The document has moved <-a href="http://localhost:8091/ui/index.html>here.

c:>curl https://localhost:18091/
curl: (7) Failed to connect to localhost port 18091 after 2260 ms: Couldn’t connect to server

c:>curl https://localhost:18091/
curl: (7) Failed to connect to localhost port 18091 after 2266 ms: Couldn’t connect to server

c:>curl http://localhost:8091/
curl: (7) Failed to connect to localhost port 8091 after 2231 ms: Couldn’t connect to server

[I’ve edited the html tags in above responses to make them appear in plaintext]

I did try accessing http://localhost:8091/ui/index.html also instead of hoping http://localhost:8091 to take me there. Created no difference.

When it stops working, on browser, it is either ERR_CONNECTION_REFUSED or in best case it shows the login page which keeps reloading itself on attempting to go past it by providing credentials.

I tried making sense of babysitter and other logs but, the content is overwhelming to process as a user.

These are on Windows 11 machine with couchbase-server-community_7.1.1-windows_amd64.msi

I was trying couchbase-server-enterprise_7.2.0-windows_amd64.msi and couchbase-server-enterprise_7.2.3-windows_amd64.msi earlier and it was giving me the same experience.

It should give the 301 “Moved” response if the couchbase management server is running.

If the couchbase management server is not running, it should immediately give “Failed to connect”. I’m not sure why yours takes over two seconds.

curl: (7) Failed to connect to localhost port 8091 after 5 ms: Couldn’t connect to server

Can you zip up your logs directory and try posting it here?

As my setup and issue is on a corporate device, I’m not comfortable in sharing the logs on public forum.

If someone can guide with troubleshooting options, I can try on my own.

I’ve never seen this behavior before, so I’m in the same boat as yourself for troubleshooting. Since yourself @getzafar and @Jaccob are the only ones having this issue, maybe you could share and compare your environments.

Given that the logs have timestamps, and the issue is not present at one moment and present at another, anything logged when the change occurs would have a timestamp between the ‘working’ time and ‘not working’ time.

Edit: look for the specific message “Can’t start process error” in babysitter.log

You could use an operation system command (like lsof for unix) - to see what process is listening on port 8091 (when it is working). And when it is not working - see if that process is still listening. If you do not see it listening, then check of if the processes even exists (ps -elp 63120) 63120 is the process-id shown by lsof. If it is not there, then it means it has exited. If it is there, then check again that it is listening on port 8091 (same as before).

 % lsof -P -n | egrep -i '8091.*LISTEN'
beam.smp  63120 michaelreiche   78u     IPv4 0x1566aaabc1737b21         0t0                 TCP *:8091 (LISTEN)
 % ps -lfp 63120
  UID   PID  PPID        F CPU PRI NI       SZ    RSS WCHAN     S             ADDR TTY           TIME CMD              STIME
  502 63120 63112 40004004   0  31  0 35791744  54356 -      Ss                  0 ??        16:34.06 /Applications/Co Tue09AM

We do not write sensitive information in the logs, but If you want to zip up your logs with encryption and post them, and then give me the key separately, you could do that.

Thanks for guiding @mreiche

It is important to highlight that both me and @Jaccob work together in same network. Also, we’re not the only ones. Everyone using Couchbase has started facing this issue. This wasn’t an issue till sometime around November last year. But, after that something changed either in our VPN, network proxy, firewall, etc or possibly (less likely though) in couchbase.

We’ve switched to using Capella instead, which is an acceptable workaround.

Having cleared that, I tried searching for “Can’t start process error” in all the logs but, couldn’t find any match.

Since ours is a Windows workspace, I can not try the provided linux commands directly.

I’ve been trusting this windows command instead-

c:> netstat -abp tcp | findstr /c:“809”

It lists all the 809x ports in use at the moment. When Couchbase is working, it lists 8091/18091 in the output apart from other 8092, 8093, etc.

When Couchbase Admin console stops being accessible, the 8091 and 18091 disappear from the output.

Can you check for the existance of an audit.bak file in lib/couchbase/config ? (it’s a sibling directory of lib/couchbase/data where the actual data files live). If you find one, delete it and try again.

Also search the log files for ‘crasher’

Right now my Couchbase 8091 is inaccessible. There is no audit.bak file under lib/couchbase/config. I can see audit.json though. Tried removing that too but, didn’t change the state of 8091

Crasher -

Search “crasher” (3580 hits in 4 files of 78 searched) [Normal]
C:\Program Files\Couchbase\Server\var\lib\couchbase\logs\babysitter.log (534 hits)
C:\Program Files\Couchbase\Server\var\lib\couchbase\logs\debug.log (1848 hits)
C:\Program Files\Couchbase\Server\var\lib\couchbase\logs\ns_couchdb.log (776 hits)
C:\Program Files\Couchbase\Server\var\lib\couchbase\logs\reports.log (422 hits)

Ok - there might be some details in those logs. (?)

This is a random instance of crasher from babysitter.log-

=========================CRASH REPORT=========================
crasher:
initial call: supervisor_cushion:init/1
pid: <0.12377.7>
registered_name:
exception exit: {abnormal,1}
in function gen_server:handle_common_reply/8 (gen_server.erl, line 811)
ancestors: [<0.12376.7>,ns_child_ports_sup,ns_babysitter_sup,<0.118.0>]
message_queue_len: 0
messages:
links: [<0.12376.7>]
dictionary:
trap_exit: true
status: running
heap_size: 4185
stack_size: 29
reductions: 13049
neighbours:

[error_logger:error,2024-02-21T14:27:57.735-06:00,babysitter_of_ns_1@cb.local:<0.12376.7>:ale_error_logger_handler:do_log:101]

Look for logging that occurs between when 8091 was accessible and when it was not accessible. Another post suggests looking in error.log.