Cbbackup is stuck

airween · July 26, 2020, 8:48am

Hi,

we have a test environment with a Couchbase cluster. The version is 5.0.1 (community).

There are a bit complex backup strategy: the nodes runs on two virtual machines, all VM’s runs on a different physical machines. There is a backup CRON job on physical machine, which starts in every hour. The backups script checks which machine runs itself, and if it’s on server2, then delayed the backup with 1800 seconds - so with this way the two backup task on nodes are shifted.

The CRON job passes an argument at morning, once a day: “daily”. All other backup is “hourly”. The different between these strategies is only the backup place and backup name.

So, here are the relevant part of backup script:

FREQUENCY="daily"
DATE=`date +"%Y%m%d"`
if [[ "$1" == "hourly" ]] ; then
    FREQUENCY="hourly"
    DATE=`date +"%Y%m%d-%H"`
fi

BACKUP_DIR="$BACKUP_BASE_DIR/$FREQUENCY"
REMOTE_BACKUP_DIR_DATE="$REMOTE_BACKUP_DIR/$DATE"
BACKUP_FILENAME="couchbase.$REMOTE_HOST.$DATE.tgz"
...
ssh -i /home/$REMOTE_USER/.ssh/id_rsa $REMOTE_USER@$REMOTE_HOST "/opt/couchbase/bin/cbbackup --username=$COUCHBASE_ADMIN --password=$COUCHBASE_PASSWD http://localhost:8091 $REMOTE_BACKUP_DIR_DATE" 2>/dev/null
...

As you can see, there isn’t any relevant different between the daily and hourly backups.

But since few weeks ago (on 4th of July was the first issue) the daily (and only the daily) backup on node2 (which started from server2 - and only the node2) stuck. Since the first issue, there was only 1 hourly backup which stuck - but there were about 4 -5 successfully daily backup. Before that time, all backups worked as well.

There isn’t any other backup or job in that time on physical machine.

On node1, all backup finished successfully.

When I logged in to system, I see a process:

28779 ? Ssl 341:18 python /opt/couchbase/lib/python/cbbackup --username=Administrator --password=SECRET http://localhost:8091 /opt/couchbase_data/backup/20200726

It runs since 4:50 AM. I checked it with strace:

[pid 28791] recvfrom(4, "", 4096, 0, NULL, NULL) = 0
[pid 28791] select(5, [4], [], [], {0, 250000}) = 1 (in [4], left {0, 249998})
[pid 28791] recvfrom(4, "", 4096, 0, NULL, NULL) = 0

these lines are repeating continuously.

What can I do, or how can I debug the cause of problem?

Thanks,

a.

dereklai · July 6, 2021, 6:39pm

I’m having pretty similar problem after upgrading to CE 6.5.1 build 6299.

I’m running a 7 node cluster. It seems like 4 nodes backup successfully (guessing from the size of backup, 7.2 GB) and the other 3 consistent hangs. Is there any good way to troubleshoot backup?

Thanks,

Derek

Topic		Replies	Views
Couchbase backup using cbbackup is taking a long time and using a lot of RAM Couchbase Server	0	1151	December 29, 2017
Tips for quick(ish) backup on Couchbase Community? Couchbase Server	1	1021	August 22, 2019
Backups in Couchbase are Very Slow....! Couchbase Server	1	2161	June 28, 2015
Cbbackup never terminates - does partial backup Couchbase Server	12	3878	November 1, 2015
CbBackup Tool gets interrupted after 30s of inactivity for specific buckets in a cluster Couchbase Server backup	5	1953	July 21, 2021

Cbbackup is stuck

Related topics