@zoltan.zvara Thank you for the information on the use case. I shall continue to look at logs and see if there is something more to the issue.
After the release of 7.0.0, we still see indexer errors (crashes). I uploaded two batches of logs, one batch 1-2 weeks ago and now another batch from our development 7.0.0 server. Hopefully, it will be of use to you @yogendra.acharya.
Hi @zoltan.zvara ,
Sorry for delayed reply.
I checked uploaded logs (both uploads) and I have noticed few things
- there are few forestdb related crash and indexer restarts but these are recoverable errors and indexer does recover after the restart, we will try to investigate the reason for it but that is not a cause of problems.
- There are lot of node status “unhealthy” messages. This looks to be the cause of problem. These messages are not for any particular node but for different node at different time. . We are looking at if these errors can be due to some issue in one of the components but in the logs we also see related connection/call time out errors, connection closed errors. Hence we suspect these could be related to cluster networking setup. I would request you to check if the cluster has stable network.
Thanks @yogendra.acharya we are going to check out cluster networking and whether there are any problems relating to that.
@yogendra.acharya we thoroughly checked our network stack by introducing low-level monitoring using Netdata. We also tuned the network stack not to run out of certain budgets.
(Below is Couchbase 7.0.1 that was upgraded from Couchbase 7.0.0.)
However, we see Couchbase Internal Server Errors from the API frequently. While this happens, the node is underutilized in terms of CPU, memory, and network. At least 50% is available from all resources at any given moment.
Our logs are current under uploading with user zoltan.zvara
.
To recap, we see the following:
-
Service 'indexer' exited with status 134.
(different nodes get 134 at the same time) Service 'indexer' exited with status 2.
Status 134 looks like an OOM, however, resources reported are close to these at all times:
In Kubernetes:
Resource limits are double:
dmesg -T | egrep -i 'killed process'
shows no signal to killing any process at the day of 134 status code.
Therefore, I think I can confirm that no one kills the indexer
and that there are plenty of resources left.
Below is a typical node utilization while the index
exits:
On this 4-node deployment, there are around 500 indexes, 500 collections, around 10 scopes, and 4 buckets.
Let me know if anything that I submitted is useful or not.
Thanks Zoltan, I’ll take a look at logs and get back.
Hi @zoltan.zvara , I am analyzing the OOMs as well as we have observed some forestdb crashes, I am involving our expert on forestdb to look at those. We will need some more time before we can get back with next update.
@yogendra.acharya sounds great, I hope I could help. We also noticed that right after a full-cluster restart, the cluster seems to be stable for a while, but only for a day.
Hi @zoltan.zvara, based on analysis of logs we are seeing some in-memory corruption to the actual index data as its stored in forestdb, this is only an in-memory corruption and there is no on-disk corruption observed also no corruption into index metadata has been observed. We still do not know the RCA of corruption. We are looking into few possibilities such as possibility of OOMs affecting memory allocation on forestdb and if that can lead to these issues, adding some instrumentation into code etc. Meanwhile is it possible for you try out Enterprise Edition (EE)? some advantages of trying out EE would be - it uses different storage engine plasma, will allow support for replicas and you can actively engage with couchbase support. So I would suggest looking into evaluating Enterprise Edition and see if that suits your needs better.
@yogendra.acharya thanks so much for your follow-up.
We upgraded CE 7.0.1 to EE 7.0.1 simply by changing the image tag on the Kubernetes StatefulSet
.
(We have a custom Helm chart that we use for deployment.)
image:
repository: "couchbase"
tag: enterprise-7.0.1
This initiated a rolling upgrade on Kubernetes, where each node had been upgraded to EE 7.0.1.
However, we see the same problem when we run our tests on this newly upgraded development 4-node cluster of EE. In our tests, we attempt to rerun failing queries for 300 seconds, after the indexing service goes down, but the query/indexing service does not recover in 300 seconds - at least when we keep issuing the same query, the Java SDK still reports Internal Couchbase Error.
I think it could be useful to note that after the upgrade, none of the nodes’ UI show any enterprisee feature to be present, but the UI writes EE everywhere. (I have been told that CE can be upgraded directly to EE, hopefully, it went OK.)
I can upload the logs if that helps.
Hi @zoltan.zvara, First of all sorry for the delayed reply,
I see that you have already opened another post for Java SDK. Also the issue of EE features not available is it resolved?
Other issue you mentioned indexing service is still going down ( I am assuming with similar frequency as was observed with CE ) please upload the logs. Our initial analysis and suggestion to upgrade to EE was based on the fact of crashes observed in forestdb storage used by CE. So it will be interesting to see if we can find something else. We would really like to get to the root cause of the issue and appreciate all the help you are providing.
@yogendra.acharya I uploaded the logs as your request. Please note that this is a CE cluster that has been updated to EE.
The issue that after the upgrade EE features are not available is not solved. Probably the cluster is sill running CE, but somehow it reports EE.
@zoltan.zvara , Thanks for uploading logs, I checked the logs and I see that all of your indexes are still using forestdb storage which is why you are still facing the same crash issue. One of the step in upgrade process is to re-create indexes on new EE node. This is due to the fact that “storage mode of indexer nodes can not be changed if there is an existing index on that node.”
Which leads to upgrade process considerations - that changing the image from CE to EE is not sufficient to completely upgrade to EE. In nutshell you will need to
a) add (an extra) new EE node to cluster
b) re-create the indexes from one of the old nodes on this newly added EE node - so that indexes which get re-created on this node, will use new storage mode available to EE.
c) drop the indexes on old node
d) remove the old node from cluster
e) upgrade image on the old node from CE to EE (you may not need this as your nodes are already upgraded to EE)
e) add the upgraded node back to cluster.
f) Repeat the process for all remaining nodes.
This will ensure all of your indexes are migrated to new storage format available to EE as well as there will be no downtime as indexes are first migrated to EE node and then corresponding CE node is taken out of cluster - one node at a time.
The details on how to upgrade from CE to EE can be found at this blogpost Upgrading Couchbase Community Edition | The Couchbase Blog
Thanks, @yogendra.acharya that was very useful. Because we are a start-up project, we will continue using CE for now and upgrade to EE when the project grows up and scales. In order to be able to catch and mitigate errors, we will keep using CE in our development environment as well, for now.