cbbackup vs. live cluster causes problems
Trying to do a cbbackup of our 5-node cluster from a machine outside the cluster during a low usage period. I intentionally chose an underpowered machine to perform the backup, in hopes that it would not create too much TAP traffic.
We're only pushing about 700 ops/sec through the cluster at the time of the backup attempt. We have a resident % of 19.1 for the cluster. The nodes are Amazon m2.xlarge (17M RAM) with only 5.9M allocated to couchbase (in hopes of addressing other OOM issues mentioned in other threads here), backed by 4-disk RAID arrays.
Once we got about 5% of the way into the backup, my web servers began having timeouts (v.1.1.2, php client) trying to retrieve data from the cluster and I noticed in the admin GUI that one of my nodes was flashing yellow PEND. After waiting for about 1 minute and noticing an increasing frequency of timeouts, I had to pull the plug. The cluster stabilized within roughly 90 seconds after stopping the backup.
I noticed while this was occuring that my TAP queue was growing in a shocking fashion. We don't generally have any TAP traffic, and in this situation it ramped from 0 to 600k in a roughly 10 minute stretch. Meanwhile, my drain rate spiked to just under 9k and plateaued. Backfill spiked from 0 to 1.25M before dropping off and settling at .75M. My disk write queue, which is generally under 100 but spikes to 500 briefly now and again, was at a sustained 400-600. Ejections/sec also spiked to some surprising levels during the period where the backup was in progress.
Are there any best practices for preventing cbbackup from having impacts like these on cluster performance? I need to be able to do backups without downtime, and currently I just can't risk it.
My backup command was simply:
/opt/couchbase/bin/cbbackup couchbase://Administrator:PASSWORD@HOST:8091 /media/couchbase/backup