Cbbackup is stuck

Hi,

we have a test environment with a Couchbase cluster. The version is 5.0.1 (community).

There are a bit complex backup strategy: the nodes runs on two virtual machines, all VM’s runs on a different physical machines. There is a backup CRON job on physical machine, which starts in every hour. The backups script checks which machine runs itself, and if it’s on server2, then delayed the backup with 1800 seconds - so with this way the two backup task on nodes are shifted.

The CRON job passes an argument at morning, once a day: “daily”. All other backup is “hourly”. The different between these strategies is only the backup place and backup name.

So, here are the relevant part of backup script:

FREQUENCY="daily"
DATE=`date +"%Y%m%d"`
if [[ "$1" == "hourly" ]] ; then
    FREQUENCY="hourly"
    DATE=`date +"%Y%m%d-%H"`
fi

BACKUP_DIR="$BACKUP_BASE_DIR/$FREQUENCY"
REMOTE_BACKUP_DIR_DATE="$REMOTE_BACKUP_DIR/$DATE"
BACKUP_FILENAME="couchbase.$REMOTE_HOST.$DATE.tgz"
...
ssh -i /home/$REMOTE_USER/.ssh/id_rsa $REMOTE_USER@$REMOTE_HOST "/opt/couchbase/bin/cbbackup --username=$COUCHBASE_ADMIN --password=$COUCHBASE_PASSWD http://localhost:8091 $REMOTE_BACKUP_DIR_DATE" 2>/dev/null
...

As you can see, there isn’t any relevant different between the daily and hourly backups.

But since few weeks ago (on 4th of July was the first issue) the daily (and only the daily) backup on node2 (which started from server2 - and only the node2) stuck. Since the first issue, there was only 1 hourly backup which stuck - but there were about 4 -5 successfully daily backup. Before that time, all backups worked as well.

There isn’t any other backup or job in that time on physical machine.

On node1, all backup finished successfully.

When I logged in to system, I see a process:

28779 ? Ssl 341:18 python /opt/couchbase/lib/python/cbbackup --username=Administrator --password=SECRET http://localhost:8091 /opt/couchbase_data/backup/20200726

It runs since 4:50 AM. I checked it with strace:

[pid 28791] recvfrom(4, "", 4096, 0, NULL, NULL) = 0
[pid 28791] select(5, [4], [], [], {0, 250000}) = 1 (in [4], left {0, 249998})
[pid 28791] recvfrom(4, "", 4096, 0, NULL, NULL) = 0

these lines are repeating continuously.

What can I do, or how can I debug the cause of problem?

Thanks,

a.

I’m having pretty similar problem after upgrading to CE 6.5.1 build 6299.

I’m running a 7 node cluster. It seems like 4 nodes backup successfully (guessing from the size of backup, 7.2 GB) and the other 3 consistent hangs. Is there any good way to troubleshoot backup?

Thanks,

Derek