After upgrade to community 7.2 - The indexer restarts a lot and become unavailable

We have followed the upgrade procedure from 6.0 to 6.5 then from 6.5 to 7.2

The problem is that the indexer is restarting a lot and sometimes all indexes become stuck in warmup state. We actually tried to remove all indexes and recreate them again, it became stable for a few hours before restarting again.

Here are the error log we’re getting :

Service 'index' exited with status 2. Restarting. Messages:
runtime.gopark(0xc0093b1f90?, 0x2?, 0x1?, 0x0?, 0xc0093b1f34?)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.20/go/src/runtime/proc.go:381 +0xd6 fp=0xc0093b1db0 sp=0xc0093b1d90 pc=0x43fbf6
runtime.selectgo(0xc0093b1f90, 0xc0093b1f30, 0xc016ef3498?, 0x0, 0x0?, 0x1)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.20/go/src/runtime/select.go:327 +0x7be fp=0xc0093b1ef0 sp=0xc0093b1db0 pc=0x44fd9e
net/http.(*persistConn).writeLoop(0xc0277f0120)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.20/go/src/net/http/transport.go:2410 +0xf2 fp=0xc0093b1fc8 sp=0xc0093b1ef0 pc=0x6f2932
net/http.(*Transport).dialConn.func6()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.20/go/src/net/http/transport.go:1766 +0x26 fp=0xc0093b1fe0 sp=0xc0093b1fc8 pc=0x6ef3a6
runtime.goexit()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.20/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0093b1fe8 sp=0xc0093b1fe0 pc=0x472221
created by net/http.(*Transport).dialConn
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.20/go/src/net/http/transport.go:1766 +0x173d

For indexer log :


2023-07-18T09:57:44.893+00:00 [Info] Repo.upgradeAndOpenDBFile(/opt/couchbase/var/lib/couchbase/data/@2i/MetadataStore): Opened with COMPACT_AUTO mode
2023-07-18T09:57:44.893+00:00 [Info] EmbeddedServer.runOnce() : Start Running Server
2023-07-18T09:57:44.902+00:00 [Info] indexer:: Staring http server : :9102
2023-07-18T09:57:44.902+00:00 [Info] Indexer::indexer version 6
2023-07-18T09:57:44.902+00:00 [Info] ClustMgr:handleGetLocalValue Key IndexerId
2023-07-18T09:57:44.902+00:00 [Info] Indexer Id 881cd9ac9655b7a2dfa6c2e479ebb286
2023-07-18T09:57:44.902+00:00 [Info] ClustMgr:handleGetLocalValue Key RebalanceRunning
2023-07-18T09:57:44.902+00:00 [Info] ClustMgr:handleGetLocalValue Key RebalanceToken
2023-07-18T09:57:44.902+00:00 [Info] Indexer::recoverRebalanceState RebalanceRunning false RebalanceToken <nil>
2023-07-18T09:57:44.902+00:00 [Info] ClustMgr:handleGetGlobalTopology indexInstMap :
2023-07-18T09:57:44.907+00:00 [Info] internalVersionMonitor:monitor starting. Term versions (6:7.0.4)
2023-07-18T09:57:44.907+00:00 [Info] internalVersionMonitor:ticker starting ...
2023-07-18T09:57:44.907+00:00 [Info] internalVersionMonitor:notifier starting ...
2023-07-18T09:57:44.908+00:00 [Info] internalVersionMonitor:monitor terminate. Cluster version reached 6
2023-07-18T09:57:44.908+00:00 [Info] internalVersionMonitor:ticker stopping ...
2023-07-18T09:57:44.938+00:00 [Info] Indexer::initFromPersistedState Recovered IndexInstMap
2023-07-18T09:57:44.939+00:00 [Info] DDLServiceMgr: intialized. Local nodeUUID 881cd9ac9655b7a2dfa6c2e479ebb286
2023-07-18T09:57:44.939+00:00 [Info] NewClusterInfoCacheLiteClient started new cicl client for schedIndexCreator
2023-07-18T09:57:44.939+00:00 [Info] schedIndexCreator: intialized.
2023-07-18T09:57:44.939+00:00 [Info] RebalanceServiceManager::NewRebalanceServiceManager false <nil>
2023-07-18T09:57:44.939+00:00 [Info] DDLServiceMgr::runTokenCleaner: Starting with period 5m0s
2023-07-18T09:57:44.939+00:00 [Info] RebalanceServiceManager::initService Init
2023-07-18T09:57:44.939+00:00 [Info] KVSender::sendShutdownTopic Projector 127.0.0.1:9999 Topic MAINT_STREAM_TOPIC_881cd9ac9655b7a2dfa6c2e479ebb286
2023-07-18T09:57:44.939+00:00 [Info] MasterServiceManager::registerWithServer: *indexer.MasterServiceManager implements service.AutofailoverManager; *indexer.MasterServiceManager implements service.Manager
2023-07-18T09:57:44.939+00:00 [Info] requestHandlerContext::getCachedIndexerNodeUUIDs: Returning 1 NodeUUIDs: [881cd9ac9655b7a2dfa6c2e479ebb286]
2023-07-18T09:57:44.939+00:00 [Info] RebalanceServiceManager::getCachedIndexerNodeUUIDs: Returning 1 NodeUUIDs: [881cd9ac9655b7a2dfa6c2e479ebb286]
2023-07-18T09:57:44.939+00:00 [Info] RebalanceServiceManager::updateNodeList: Initialized with 1 NodeUUIDs: [881cd9ac9655b7a2dfa6c2e479ebb286]
2023-07-18T09:57:44.940+00:00 [Info] RebalanceServiceManager::GetTaskList []
2023-07-18T09:57:44.940+00:00 [Info] RebalanceServiceManager::GetTaskList returns &{Rev:[0 0 0 0 0 0 0 1] Tasks:[]}
2023-07-18T09:57:44.941+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
2023-07-18T09:57:44.941+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 1] Nodes:[881cd9ac9655b7a2dfa6c2e479ebb286] IsBalanced:true Messages:[]}
2023-07-18T09:57:44.941+00:00 [Info] RebalanceServiceManager::GetTaskList [0 0 0 0 0 0 0 1]
2023-07-18T09:57:44.941+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 1]
2023-07-18T09:57:44.949+00:00 [Error] KVSender::sendShutdownTopic Unexpected Error During Shutdown Projector 127.0.0.1:9999 Topic MAINT_STREAM_TOPIC_881cd9ac9655b7a2dfa6c2e479ebb286. Err genServer.closed
2023-07-18T09:57:44.949+00:00 [Error] KVSender::closeMutationStream MAINT_STREAM  Error Received genServer.closed from 127.0.0.1:9999
2023-07-18T09:57:44.949+00:00 [Info] KVSender::closeMutationStream MAINT_STREAM  Treating genServer.closed As Success
2023-07-18T09:57:44.949+00:00 [Info] KVSender::sendShutdownTopic Projector 127.0.0.1:9999 Topic INIT_STREAM_TOPIC_881cd9ac9655b7a2dfa6c2e479ebb286
2023-07-18T09:57:44.950+00:00 [Error] KVSender::sendShutdownTopic Unexpected Error During Shutdown Projector 127.0.0.1:9999 Topic INIT_STREAM_TOPIC_881cd9ac9655b7a2dfa6c2e479ebb286. Err projector.topicMissing
2023-07-18T09:57:44.950+00:00 [Error] KVSender::closeMutationStream INIT_STREAM  Error Received projector.topicMissing from 127.0.0.1:9999

I actually have no idea what’s happening, do you have any idea on what might be the problem and how to solve it ?

@varun.velamuri I see the log has some projector errors. If you can take a quick look that will be great.

Thanks for your reply, we unfortunately rolled back to 6.5 because we needed to make the server available again so i don’t have access to those logs anymore.

What is actually strange is that we tested that version on our pre-production environment without any issues. The problem only occured in our production environment