Cbbackupmgr inconsistent results

cbbackupmgr version 7.0.2-6703
OS: linux Version: 5.10.178-162.673.amzn2.x86_64
Arch: amd64 vCPU: 2 Memory: 4038017024 (3.76GiB)

I have a simple docker container running the following statements, which is scheduled by AWS to run once a day.

cd /usr/local/halix/backup
/opt/couchbase/bin/cbbackupmgr config -a /usr/local/halix/backup -r prod
/opt/couchbase/bin/cbbackupmgr backup -a /usr/local/halix/backup -r prod -c http://$DB_URI:8091 -u $DB_USERNAME -p $DB_PASSWORD --full-backup
zip -r backup.zip .
/usr/local/halix/s3put -k $AWS_ACCESS_KEY -s $AWS_ACCESS_SECRET -b $S3_URL put backup.zip

It successfully uploads a zip every night, but the result is inconsistent. Some backups are 2.6GB and others are 3.3GB. The difference appears to be that data in the 3rd bucket (last bucket) is missing or gets cut off. Also the backup-0.log appears to abruptly stop with no error produced.

In the 3.3GB backups, the log ends properly with:
2023-05-15T00:17:11.799+00:00 (Plan) Transfer for cluster complete
2023-05-15T00:17:11.799+00:00 (Plan) Transfer of all data complete
2023-05-15T00:17:11.800+00:00 (Cmd) Backup completed successfully
2023-05-15T00:17:11.800+00:00 (Stats) Stopping stat collection

On the 2.6GB backups, the log ends randomly. No error or anything, but just mid-stream it seems. Example:

2023-05-17T00:11:23.490+00:00 (DCP) (usage) (vb 1000) Creating DCP stream | {“uuid”:0,“start_seqno”:0,“end_seqno”:5824,“snap_start”:0,“snap_end”:0,“retries”:0}
2023-05-17T00:11:23.492+00:00 (DCP) (usage) (vb 357) Creating DCP stream | {“uuid”:0,“start_seqno”:0,“end_seqno”:6139,“snap_start”:0,“snap_end”:0,“retries”:0}

As I’m running a zip command after the cbbackupmgr command finishes, I know that the backup command is finishing and not crashing the docker container or anything. Is it possible for this tool to return as complete while it is still actually doing work? Do I need to add a pause before attempting to zip the backup directory to give the system time to finish writing backup files? I’m baffled and need to answer these inconsistencies before we can completely remove the old backup tools from our production environments. Any debugging ideas or suggestions as to what might cause what I’m seeing would be greatly appreciated. I’d upload the full logs, but does not appear my account is allowed to attach files. Thanks!

I went ahead and unzipped each backup and ran cbbackupmgr info on each. All the 2.6GB backups are flagged as complete=false and don’t report a proper size. So now it is clear these are invalid/corrupt backups. My follow-up question is what would cause 5 out of every 7 backups to not complete. As mentioned, the backup-0.log file just ends. It doesn’t list any specific error occurring.

I also went to our AWS CloudWatch logs for the instance that spins up to run these backups and it just reports “killed”:

2023-05-15T20:11:49.199-04:00 Transferring key… at 9.40MiB/s (about 5m22s remaining) 6216698 items / 5.85GiB

2023-05-15T20:11:49.199-04:00 [=============================================== ] 66.43%
2023-05-15T20:11:51.499-04:00 Killed

So it got about 2/3rds of the way through and terminated. Is anyone aware of a reason this might occur? What are the RAM requirements for the tool? I set them up to be the same as the cbbackup tool is currently using for our production backups. However, maybe cbbackupmgr is more aggressive with memory use? I’m going to try and bump up the RAM we give the instance and see what happens.

Did they exceed a resource limit?

That’s what we’re thinking. I’m seeing that AWS may kill a container if it exceeds its allotted resources. What we are running into though isn’t killing off the whole container. Just the cbbackupmgr command seems to terminate early and then the container continues on to zip up what it was able to backup and ship it off to the S3 backup storage. I would have expected a complete stop of the container, but maybe AWS is monitoring and killing off a single process/thread that exceeds but letting the container continue. I’m barely intermediate level with AWS.

I’ve bumped up the resources (CPUx2 and memoryx4) and we’ll see how it runs tonight. I’ll also try to investigate whether AWS separately tracks/logs when resource limits are exceeded. The web says we’d see something like “OutOfMemoryError: Container killed due to memory usage”, but of course in this case the container itself isn’t being killed. So maybe the message is less verbose if triggered at a thread level.

Looks like maybe we could also configure the container to “swap” memory to disk (paging) as a fallback. Hopefully the resources bump will be enough though. Also from some more web searching, I get the feeling the generic “killed” message means it isn’t AWS, but instead a low level Linux process killing off the thread… probably before AWS has detected anything itself to shut down the container. So still leading towards a likely resource issue CPU or memory.

I went and looked at the logs for the older cbbackup tool, which is currently what production relies on. Turned out that while it doesn’t get killed, it takes nearly 2 hours to back up the database which is rather small. So it seems pretty obvious we were never giving these processes enough resources and I just copied what the old one had when setting up the new one.

Last night with the increased resources, the new process finished the backup, zipped, and sent to S3 all in less than 6 minutes. I’ll need to let it run a few more days to see if it is stable now before making it our primary backup tool, but feel pretty confident. So closing this out. The answer being we were just offering insufficient resources to the cbbackupmgr process.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.