[MB-7149] cbbackup loops infinitely Created: 10/Nov/12 Updated: 18/May/13 |
|
| Status: | Reopened |
| Project: | Couchbase Server |
| Component/s: | tools |
| Affects Version/s: | 2.0-beta-2 |
| Fix Version/s: | 2.0.2 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Mike Wiederhold | Assignee: | Bin Cui |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | 2.0-release-notes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | Linux | ||
| Description |
|
I'm trying to backup my cluster via cbbackup. I start the backup via
/opt/couchbase/bin/cbbackup couchbase://Administrator:password@mymachine:8091 /backups/couchbase_backup_test The backup appears to work fine and a progress bar appears, but then exceeds 100% progress and never stops! Example: ^Cinterrupted.###############################] 210.8% (26141950/12403784 msgs) (I ctrl-C'd to kill it after 200% because it seems that this can't possibly work. Note that both the percent and the number of messages are off). I only have ~12 million items in the bucket, but it went right past that limit when backing up. Help? |
| Comments |
| Comment by Steve Yen [ 12/Nov/12 ] |
|
There's a couple things going on here that probably need documentation...
- cbbackup first contacts the server to get the # of items. But, if the cluster is changing (there are item mutations), that # of items will just be an estimate. - then, cbbackup uses the TAP protocol to perform the backup. But, under some conditions (not all item values are resident in memory), the TAP protocol might actually send duplicate messages. That's why cbbackup reports "msgs" for progress instead of "items" in its numerator, but uses "items" in its denominator. That can lead to >100% in some cases. Whether it leads to >200% is somewhat unexpected, but it depends on the situation and what couchbase server is doing in generating the TAP stream. |
| Comment by Steve Yen [ 12/Nov/12 ] |
| The 2.0.2 filter isn't correct at the moment. Putting this into 2.0.1 for the moment as it'll be revisited again. |
| Comment by Michael L [ 27/Nov/12 ] |
|
(I am the original poster)
Changing the line above to use an IP address rather than hostname seems to have fixed the problem. My backups now run to 100% and then complete as expected. As for the root cause: I don't believe it has anything to do with the cluster changing, since I first encountered this when trying to backup an essentially idle cluster. |
| Comment by Bin Cui [ 06/Dec/12 ] |
| Verified on a multi-node cluster that cbtransfer get the total item number correctly. And fail to reproduce the bug on the idle cluster. |
| Comment by Mike Wiederhold [ 06/Dec/12 ] |
| I asked the user on the forums for more information to reproduce this issue. I will post the information here if and when he responds. |
| Comment by Farshid Ghods [ 10/Dec/12 ] |
| deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th ) |
| Comment by Paul Janssen [ 04/Jan/13 ] |
|
I have the same problem.
Observed progress over 32000%. The expected msgs to save is far less than the actual msgs saved. Restore will load the same number of msgs as were actually saved. This will impact backup and restore time. This will impact diskspace. |
| Comment by Paul Janssen [ 04/Jan/13 ] |
|
Version info: 2.0.0 community edition (build-1723)
|
| Comment by Paul Janssen [ 04/Jan/13 ] |
| Using ip-address (local,external) or hostname (localhost) does not make any difference, issue remains. |
| Comment by Maria McDuff [ 25/Mar/13 ] |
|
bug scrub: Bin -- have you had a chance to take a look? pls update. thanks. |
| Comment by Bin Cui [ 25/Mar/13 ] |
| Cannot reproduce it in house. |
| Comment by Maria McDuff [ 01/Apr/13 ] |
| per bug scrub: abhinav -- can you please repro in latest 2.0.2 build? thanks. |
| Comment by Abhinav Dangeti [ 01/Apr/13 ] |
|
Cannot reproduce on 2.0.2-749-rel. - 3 nodes, 2 buckets [root@orange-11601 ~]# /opt/couchbase/bin/cbbackup couchbase://Administrator:password@localhost:8091 ~/backup [####################] 100.0% (8557766/8558766 msgs) bucket: default, msgs transferred... : total | last | per sec batch : 10657 | 10657 | 14.7 byte : 1131794345 | 1131794345 | 1562011.5 msg : 8558766 | 8558766 | 11812.1 [####################] 100.0% (2024739/2024739 msgs) bucket: saslbucket, msgs transferred... : total | last | per sec batch : 14775 | 14775 | 86.0 byte : 1367390279 | 1367390279 | 7955075.3 msg : 10583505 | 10583505 | 61571.7 done |
| Comment by Maria McDuff [ 09/Apr/13 ] |
| not reproducible. |
| Comment by Bin Cui [ 17/May/13 ] |
|
First, the error itself is harmless. The tool tried to transfer design docs and the source cluster doesn't contain any. Since 2.0.2, customer can specify --data-only option for cbtransfer/cbback/cbrestore tool. But we still dont know the root cause why there is such a big difference between the initial msgs to be sent and the final msgs that are transferred. |
| Comment by Bin Cui [ 17/May/13 ] |
|
One possible explanation about the deviant number: 1. the estimate number is the total active item number 2. the actual msg tranferred = total(tap_mutations + tap_delete) For the above customer case where they have 2 million item deleted, we will not only the current active items, but also any deleted items. At again, there will more msgs transferred if any key will have repeated set/deletions. |
| Comment by Perry Krug [ 18/May/13 ] |
| Reopening for visibility. Whether the tool is doing the "right" thing or not, there is still a major impact to the user both in terms of disk space being taken up, time being taken and perception of confidence, etc in the backup. |