Hi. This is scenario:
It is couchbase server 6.0.0 CE. Statefull set in k8s. First pod is master and has indexer,query, data services. Other 3 nodes are only data nodes.
I was testing single point of failure when node with indexer will shortly goes down.
So I only:
- Delete POD with indexer and make sure it will not start again
- Wait for 15 seconds (autofailover is off)
- Run that pod with indexer.
- Now bucket was loaded from that nodes really without problem, rebalanced ok. But indexes are down.
- Syncgw is giving error that indexer is warming up. 80 minutes and no success.
Indexer log is here: https://gist.github.com/erikbotta/dbb726da48d5a570e8c911ce0ebd6b06
EDIT: after restarting of POD indexer is working now. But it seems to be unstable.
During indexer warmup, indexer initialises the indexes one-by-one. During index initialisation, persisted index data is read from the disk. As per the logs attached, first 11 indexes (out of 15 indexes) have been initialised successfully. But the initialisation of the twelfth index “idx_owner_id_model_type_0” seems to be stuck and it also logged an error. I suspect initialisation being stuck because we don’t see any errors logged by indexer which are reported by forestdb.
2019-06-04T11:52:57.533+00:00 [Info] Indexer::initPartnInstance Initialized Partition:
Index: 5349895838612919745 Partition: PartitionId: 0 Endpoints: [:9105]
2019-06-04T11:52:57.586+00:00 [INFO][FDB] Forestdb opened database file /opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index/data.fdb.1
2019-06-04T11:52:59.897+00:00 [ERRO][FDB] Crash Detected: Last Block not DBHEADER dd in a database file '/opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index/data.fdb.2'
Later (after second indexer restart), the forestdb recovers itself from this situation.
2019-06-04T13:09:07.244+00:00 [INFO][FDB] Forestdb opened database file /opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index/data.fdb.1
2019-06-04T13:09:07.429+00:00 [ERRO][FDB] Crash Detected: Last Block not DBHEADER dd in a database file '/opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index/data.fdb.2'
After more than 15 minutes, the recovery completes.
2019-06-04T13:25:52.477+00:00 [INFO][FDB] Forestdb opened database file /opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index/data.fdb.2
2019-06-04T13:28:06.364+00:00 [INFO][FDB] Partially compacted file '/opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index/data.fdb.2' could not be used for recovery. Using old file /opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index/data.fdb.1.
2019-06-04T13:28:06.364+00:00 [INFO][FDB] Forestdb closed database file /opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index/data.fdb.2
2019-06-04T13:28:06.370+00:00 [Info] ForestDBSlice:NewForestDBSlice Created New Slice Id 0 IndexInstId 5349895838612919745 WriterThreads 1
2019-06-04T13:28:06.372+00:00 [Info] Indexer::initPartnInstance Initialized Slice:
Index: 5349895838612919745 Slice: SliceId: 0 File: /opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_model_type_0_5349895838612919745_0.index Index: 5349895838612919745
2019-06-04T13:28:06.373+00:00 [Info] Indexer::initPartnInstance Initialized Partition:
Index: 17397300330000381276 Partition: PartitionId: 0 Endpoints: [:9105]
2019-06-04T13:28:06.431+00:00 [INFO][FDB] Forestdb opened database file /opt/couchbase/var/lib/couchbase/data/@2i/health-checker_idx_owner_id_0_17397300330000381276_0.index/data.fdb.2
Assuming that there was no forestdb file corruption/truncation, it is expected from forestdb to recover from partial compaction. But, not sure if forestdb should take this much time to recover.
Hi, as I wrote in EDIT. Situation was fixed by restarting POD. But this wasn’t for the first time indexer stucked. Storage for couchbase and indexes is on Private Volume on Azure AKS service (1TB premium disks). So I have no idea how could be forestdb file corrupted.
@Erik_Botta, what is the compaction setting for index? It is recommended to use “circular write mode” rather than “append-only write”. You can check under UI->Settings->Auto-compaction.
Also, for HA purpose, it is always advisable to have 2 index service nodes with an equivalent index on second node(system managed replica index in EE edition). In the event of failure of one index node, this would ensure no downtime.
Hello. Thank you for advises.
I checked that auto-compaction mode for index was set to circular corectly.
For now we are testing only one index node, because CE version has only 2 options. First we are using now, and the second one, where all nodes are with all services (which is not best for performance and wasn’t very stable).
Frankly, we weren’t able till now to find out stable solution under CE licenced couchbase(6.x.x)+syncgw(2.5.x) and k8s on azure AKS :-/ .
When indexer is failing, syncgw is not able to serve bulk docs for us and we are having issues with tests.
If it is feasible, please try out EE. The storage engine is much more robust. If CE is the only option, you can have one more index+query+data node for better redundancy.
@deepkaran.salooja Thx, but if I need 4 nodes only 2 combinations which are allowed in CE by API are:
- 1xindex+query+data node + 3xdata nodes
Combination : 2xindex+data+query nodes + 2xdata nodes isn’t allowed.
Am I true? This is what REST API of couchbase CE said.
You can try with option 2 and while creating the indexes you always have the ability to specify its location using the WITH clause. So just place the indexes on 2 index nodes only. Index service on other 2 nodes would have close to 0 resource utilization. Query service is stateless but you’ll need to make sure there is enough cpu/memory depending on the query workload so it doesn’t interfere with data service.