Index mutation keeps getting stuck

In our test environment we keep seeing index mutations becoming stuck, particularly on one particular bucket (ticket_bucket) but as you can see it’s also starting to occur on other buckets:

It’s a single node, couchbase 6.5.1 - autonomous operator running on an Azure kubernetes cluster.

Relevant logs:
logs.zip (3.9 MB)

This is a quiet, low volume system so it’s not a performance issue. It’s the second time we’ve seen it (the first time we saw it, we blew the environment away and started again). Any help will be greatly appreciated.

I deleted the ticket bucket primary index and tried to recreate but got this:
“GSI CreatePrimaryIndex() - cause: Create index or Alter replica cannot proceed due to rebalance in progress, another concurrent create index request, network partition, node failover, indexer failure, or presence of duplicate index name.”

@pc , thanks for sharing the logs. It looks like one of the projectors(the process which forwards the mutations from data service to index service) is stuck and not responding. You can kill the projector process to unblock. Also, if you can share the projector log file, we can check what the issue is with the projector.

2021-06-04T10:25:41.221+00:00 [Warn] Slow/Hung Operation: KVSender::sendMutationTopicRequest did not respond for 68h36m33.258388772s for projector cb-0000.cb.test-tdm.svc:9
999 topic MAINT_STREAM_TOPIC_f55e55d41c45ea1ff7ff9824dad0b3f2 bucket location_bucket
2021-06-04T10:25:41.221+00:00 [Warn] Slow/Hung Operation: KVSender::sendMutationTopicRequest did not respond for 68h36m32.640886572s for projector cb-0000.cb.test-tdm.svc:9
999 topic MAINT_STREAM_TOPIC_f55e55d41c45ea1ff7ff9824dad0b3f2 bucket prediction_bucket
2021-06-04T10:25:42.221+00:00 [Warn] Slow/Hung Operation: KVSender::sendMutationTopicRequest did not respond for 68h36m33.047651308s for projector cb-0000.cb.test-tdm.svc:9
999 topic MAINT_STREAM_TOPIC_f55e55d41c45ea1ff7ff9824dad0b3f2 bucket weather_bucket

Thanks for your response.

I’ve killed the projector a couple of times and left it for about 10 mins but it hasn’t made any difference on the outstanding index mutations.

The projector log which includes one of the projector restarts is here:
projector.zip (1.3 MB)

If i kill the whole kubernetes pod and wait for it to come back, it processes all mutations before then becoming stuck again.

@pc, after the restart I don’t see pending mutations from projector logs:

2021-06-15T08:38:46.208+00:00 [Info] KVDT[<-ticket_bucket<-127.0.0.1:8091 #MAINT_STREAM_TOPIC_f55e55d41c45ea1ff7ff9824dad0b3f2] ##1d stats: {"numDocsPending":0}

If you can capture both the projector and indexer log file when you see the pending mutations on the UI, then we can corelate and try to figure out the problem.

logs.zip (2.7 MB)

Here are the logs for both projector and indexer. I killed the projector at 8:57 ish.

Mutations in the UI:

@pc ,

It appears that you have ended up in the bug Loading.... From the projector logs, control channel is full. This can happen when there are more than 10 buckets in the cluster

2021-06-17T08:58:25.310+00:00 [Warn] FEED[<=>MAINT_STREAM_TOPIC_f55e55d41c45ea1ff7ff9824dad0b3f2(127.0.0.1:8091)] ##15 control channel has 10000 messages

This issue has been fixed in 6.6.0 and later versions. As a work-around, please change the setting “projector.backChanSize” to 50000. E.g.,

curl -u : http://<indexer_ip>:9102/settings -X POST -d ‘{“projector.backChanSize”:50000}’

After the setting change, please restart projector. After restart, you should see the below message in projector logs

settings projector.backChanSize will updated to 50000

Thanks,
Varun

That did the trick! Thanks.

Will this setting persist? and if i were to scale out my couchbase deployment, would the other servers also have this new setting?

@pc,

Apologies for the delayed reply. I somehow missed this notification.

Will this setting persist?

Yes, this setting would persist.

if i were to scale out my couchbase deployment, would the other servers also have this new setting?

Yes. New servers would have this new setting

Thanks,
Varun

1 Like