Cbbackupmgr not working since indexer issue - "Fail to retrieve local metadata"

We are running on couchbase community. Had an indexer issue happen on one of our production nodes the other day. We only noticed it when disk space dropped to 0 many hours later. After increasing disk space, we had to remove the node from the cluster, rebalance, bring the node back in, balance, and recreate all the indexes on that node. However since that time backupmgr has been failing. The error is:

Error backing up cluster: internal server error executing 'GET' request to '/api/v1/bucket/halix2/backup': 
{
    "code": "error",
    "error": "Fail to retrieve local metadata from url 172.31.21.193:9102.",
    "result": {}
}

So it appears we haven’t fully recovered from what happened, but are unsure what to try. Any suggestions?

The indexer issue had also happened 5 days earlier, twice in a short time, but recovered on its own. Here is a sampling of that from the error log:

ns_server:error,2023-09-06T19:24:31.082Z,ns_1@172.31.21.193:service_agent-index<0.12227.0>:service_agent:handle_info:285]Lost json rpc connection for service index, reason shutdown. Terminating.
[ns_server:error,2023-09-06T19:24:31.082Z,ns_1@172.31.21.193:service_agent-index<0.12227.0>:service_agent:terminate:306]Terminating abnormally
[ns_server:error,2023-09-06T22:22:50.997Z,ns_1@172.31.21.193:service_agent-index<0.9069.333>:service_agent:handle_info:285]Lost json rpc connection for service index, reason shutdown. Terminating.
[ns_server:error,2023-09-06T22:22:50.997Z,ns_1@172.31.21.193:service_agent-index<0.9069.333>:service_agent:terminate:306]Terminating abnormally
[ns_server:error,2023-09-11T22:03:51.512Z,ns_1@172.31.21.193:service_agent-index<0.15039.356>:service_agent:handle_info:285]Lost json rpc connection for service index, reason shutdown. Terminating.
[ns_server:error,2023-09-11T22:03:51.512Z,ns_1@172.31.21.193:service_agent-index<0.15039.356>:service_agent:terminate:306]Terminating abnormally
[ns_server:error,2023-09-11T22:03:51.512Z,ns_1@172.31.21.193:service_status_keeper_worker<0.1079.0>:rest_utils:get_json:57]Request to (indexer) getIndexStatus with headers [{"If-None-Match", "21058684869cb89c"}] failed: {error,{econnrefused, [{lhttpc_client,send_request,1,[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},                                                                                     {line, 220}]},{lhttpc_client,execute,9,[{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},{line, 169}]}, {lhttpc_client, request,9,[{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},{line,93}]}]}}

The first sign of running out of disk space was:

user:error,2023-09-12T15:26:08.155Z,ns_1@172.31.21.193:<0.764.1433>:compaction_daemon:ensure_can_db_compact:1028]Cannot compact database `halixpos/4`: the estimated necessary disk space is about 813
7554 bytes but the currently available disk space is 0 bytes.

So the indexer started having issues and then was out of disk space 17 hours later. The server had 60GB and only using 40% of it at the time. And after our recovery efforts to get it up and running again, disk space dropped back to that level again without us deleting anything.

We need to get the backups functioning again. Any debugging ideas would be greatly appreciated. There’s almost no mention of “Fail to retrieve local metadata” in the web in relation to cbbackupmgr.

Thank you,
~Brandon Williams

The backup script runs these two commands:

/opt/couchbase/bin/cbbackupmgr config -a /usr/local/halix/backup -r prod
/opt/couchbase/bin/cbbackupmgr backup -a /usr/local/halix/backup -r prod -c http://$DB_URI:8091 -u $DB_USERNAME -p $DB_PASSWORD --full-backup

Here is the error from the backup-0.log

2023-09-14T00:01:02.852+00:00 (Plan) (Index) Transferring GSI index definitions for bucket 'halixpos'
2023-09-14T00:01:02.852+00:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.31.20.1:9102/api/v1/bucket/halixpos/backup'
2023-09-14T00:01:02.943+00:00 (REST) (Attempt 1) (GET) (500) Received response from 'http://172.31.20.1:9102/api/v1/bucket/halixpos/backup'
2023-09-14T00:01:02.944+00:00 WARN: (REST) (Attempt 1) (GET) Request to endpoint '/api/v1/bucket/halixpos/backup' failed with status code 500 -- rest.(*Request).Execute() at request.go:166
2023-09-14T00:01:02.944+00:00 (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operation for bucket 'halixpos': failed to transfer index definitions for bucket 'halixpos': failed to transfer GSI indexes: failed to get GSI index defintions: failed to get GSI index definitions: failed to execute request: internal server error executing 'GET' request to '/api/v1/bucket/halixpos/backup': {"code":"error","error":"Fail to retrieve local metadata from url 172.31.21.193:9102.","result":{}}

So I have worked around this issue by adding --disable-gsi-indexes to the config line:

/opt/couchbase/bin/cbbackupmgr config -a /usr/local/halix/backup -r prod --disable-gsi-indexes

We don’t restore the GSI indexes anyway and the backup-0.log points to the issue being retrieving metadata for the GSI indexes of the halixpos bucket. Skipping them allows the data to backup properly.

I’d still love to know how I can fix whatever is wrong with the GSI metadata on that specific node. What sort of CLI commands are available to debug indexer service metadata issues?

@hyunjuV - maybe you can help?

Hi @bwilliams,

2023-09-14T00:01:02.944+00:00 (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operation for bucket 'halixpos': failed to transfer index definitions for bucket 'halixpos': failed to transfer GSI indexes: failed to get GSI index defintions: failed to get GSI index definitions: failed to execute request: internal server error executing 'GET' request to '/api/v1/bucket/halixpos/backup': {"code":"error","error":"Fail to retrieve local metadata from url 172.31.21.193:9102.","result":{}}

As a part of backup activity, indexer needs to collect metadata from all index nodes. Indexer running on 172.31.20.1 performs this collection activity. When indexer running on 172.31.20.1 attempts to read local index metadata from node 172.31.21.193, it received an error. Unfortunately, the not all errors get logged.

Please check indexer.log file from nodes 172.31.21.193 and 172.31.20.1 to see if there is any error message logged.

If the problem persists, please try to make a rest call “:9102/getLocalIndexMetadata” call to indexer running on 172.31.21.193, and check the response.

The indexer.log’s are very busy, but not with errors. Going back to the oldest available, indexer.log.9, it is only 2 days ago. So unfortunately not back to when the issue started occurring or when we were still trying to retrieve GSI indexes during the backup. There is an interesting error logged though in the oldest log:

2023-09-20T00:29:17.136+00:00 [ERRO][FDB] Error in reading the doc length metadata with offset 205145 from a database file '/opt/couchbase/var/lib/couchbase/data/@2i/MetadataStore.2'
2023-09-20T00:29:18.313+00:00 [ERRO][FDB] doc_length key checksum mismatch error in a database file '/opt/couchbase/var/lib/couchbase/data/@2i/MetadataStore.2' crc 18 != 7e (crc in doc) keylen 14462 metalen 19832 bodylen 1661337600 bodylen_ondisk 813588844 offset 20690
2023-09-20T00:29:18.313+00:00 [ERRO][FDB] docio_read_doc_key error: read failure on offset 20690 in a database file '/opt/couchbase/var/lib/couchbase/data/@2i/MetadataStore.2' : FDB status -15, lastbid 0x5, curblock 0xc6, curpos 0x43

It does not appear in any of the newer logs though. So I’m not sure if it’s telling us anything interesting.

requested-rest-call.zip (4.0 KB)
Attached is the resulting output of the rest call you requested. Seems it completed successfully. I tried locally from .193, as well as from one of the other nodes. So it seems the node can reply with the index metadata. One interesting thing worth noting, I don’t see the halixpos bucket mentioned in the output. Looking at the balance of the indexes across our nodes, all the halixpos indexes are on other nodes. None of them are on .193. However, the cbbackupmgr is complaining specifically about retrieving halixpos bucket

2023-09-14T00:01:02.852+00:00 (Plan) (Index) Transferring GSI index definitions for bucket 'halixpos'
2023-09-14T00:01:02.852+00:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.31.20.1:9102/api/v1/bucket/halixpos/backup'
2023-09-14T00:01:02.943+00:00 (REST) (Attempt 1) (GET) (500) Received response from 'http://172.31.20.1:9102/api/v1/bucket/halixpos/backup'
2023-09-14T00:01:02.944+00:00 WARN: (REST) (Attempt 1) (GET) Request to endpoint '/api/v1/bucket/halixpos/backup' failed with status code 500 -- rest.(*Request).Execute() at request.go:166
2023-09-14T00:01:02.944+00:00 (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operation for bucket 'halixpos': failed to transfer index definitions for bucket 'halixpos': failed to transfer GSI indexes: failed to get GSI index defintions: failed to get GSI index definitions: failed to execute request: internal server error executing 'GET' request to '/api/v1/bucket/halixpos/backup': {"code":"error","error":"Fail to retrieve local metadata from url 172.31.21.193:9102.","result":{}}

Could it be the backup is failing simply because the node it was querying for metadata on doesn’t have any GSI indexes for that bucket? Maybe something to do with the out of disk space indexer crash, combined with us never rebuilding any indexes for that bucket on that node after? So it’s left in a funky state for that specific bucket?

Thanks for your time and consideration.

~Brandon