we are running a 5 node couchbase cluster in kubernetes environment. We have a persistent volume attached to each node through an external GlusterFs storage server. After 2 days of load run, one of the couchbase server pod has restarted. From analysing the logs of the crash time, it looks to be due to ‘memcached’ restart. But I am not able to figure out why memcached would have restarted. The relevant log snippets are as below:
couchbase-0 pod got restarted on 23-Apr-2021, 18:39 UTC. On analysing the logs at that time, I infer that the memcached has crashed which should have triggerd the pod to restart. Some log snippets as below:
debug.log
149917 [ns_server:debug,2021-04-23T18:39:49.854Z,ns_1@cas-couchbase-0.cas-couchbase-service:memcached_config_mgr <0.903.0>:memcached_config_mgr:init:48]ns_ports_setup seems to be ready
149918 [user:debug,2021-04-23T18:39:49.896Z,ns_1@cas-couchbase-0.cas-couchbase-service: <0.629.0>:ns_log:crash_consumption_loop:70] Service ‘memcached’ exited with status 0. Restarting. Messages:
149919 2021-04-23T18:39:49.241725+00:00 WARNING 199: Invalid password specified for [@ns_server] UUID:[70ec218d-dc1a-470b-72c1-1aa8b8fb0ae4]
149920 2021-04-23T18:39:49.343462+00:00 WARNING 200: Invalid password specified for [@ns_server] UUID:[56b4b926-72a6-431f-b2e0-e224a5d0aa13]
149921 2021-04-23T18:39:49.352313+00:00 WARNING 201: Invalid password specified for [@ns_server] UUID:[45558727-8cf3-405b-6b0b-9b64f8034559]
149922 2021-04-23T18:39:49.383944+00:00 WARNING 202: Invalid password specified for [@ns_server] UUID:[e8036490-b088-4558-3002-139dc1153bef]
149923 2021-04-23T18:39:49.555734+00:00 WARNING 203: Invalid password specified for [@ns_server] UUID:[13928919-66de-4d23-37dd-3b9d9320b88f]
149924 2021-04-23T18:39:49.669401+00:00 WARNING 203: Invalid password specified for [@ns_server] UUID:[9011fbe2-925e-4578-cc55-9b632407b534]
149925 2021-04-23T18:39:49.785833+00:00 WARNING 204: Invalid password specified for [@ns_server] UUID:[fa6c7f21-3377-40c9-1aab-3730709ed6e9]
149926 EOL on stdin. Initiating shutdown
reports.log
7449 [error_logger:error,2021-04-23T18:39:45.984Z,ns_1@cas-couchbase-0.cas-couchbase-service:error_logger <0.6.0>:ale_error_logger_handler:do_log:203]
7450 =========================CRASH REPORT=========================
7451 crasher:
7452 initial call: gen_event:init_it/6
7453 pid: <0.287.0>
7454 registered_name: bucket_info_cache_invalidations
7455 exception exit: killed
7456 in function gen_event:terminate_server/4 (gen_event.erl, line 320)
7457 ancestors: [bucket_info_cache,ns_server_sup,ns_server_nodes_sup,
7458 <0.168.0>,ns_server_cluster_sup,<0.89.0>]
7459 messages:
7460 links:
7461 dictionary:
7462 trap_exit: true
7463 status: running
7464 heap_size: 376
7465 stack_size: 27
7466 reductions: 133
7467 neighbours:
debug.log
114651 [ns_server:warn,2021-04-23T18:39:10.789Z,nonode@nohost:dist_manager<0.141.0>:dist_manager:wait_for_address:118] Could not resolve address cas-couchbase-0.cas-couchbase-service
: nxdomain
145240 [ns_server:warn,2021-04-23T18:39:40.630Z,ns_1@cas-couchbase-0.cas-couchbase-service:memcached_refresh <0.173.0>:ns_memcached:connect:1227] Unable to connect: {error,{badmatch,{error,econnrefused}}}.
Indexer.log
52870 2021-04-23T18:39:43.111+00:00 [Warn] Indexer Failure to Init Get http://127.0.0.1:8091/_metakv/indexing/settings/config: Unable to find given hostport in cbauth database: `127.0.0.1:8091’
52871 2021-04-23T18:39:43.111+00:00 [Info] Indexer exiting normally
Info.log
292399 [ns_server:warn,2021-04-23T18:39:38.625Z,ns_1@cas-couchbase-0.cas-couchbase-service:memcached_refresh <0.173.0>:ns_memcached:connect:1227]Unable to connect: {error,{badmatch,{error,econnrefused}}}.
query.log
134060 _time=2021-04-23T18:39:48.572+00:00 _level=ERROR _msg=Cannot connect url http://127.0.0.1:8091 - cause: Get http://127.0.0.1:8091/pools: dial tcp 127.0.0.1:8091: getsockopt: connection refused
134061 _time=2021-04-23T18:39:48.572+00:00 _level=ERROR _msg=Shutting down.
134062 [goport(/opt/couchbase/bin/cbq-engine)] 2021/04/23 18:39:48 child process exited with status 1
Can you please check the above log snippets and suggest, what could have gone wrong?
Thanks,
Ganesh.