[MB-7149] [DOC 2.1.0] cbbackup loops infinitely Created: 10/Nov/12  Updated: 24/Jul/13  Resolved: 24/Jul/13

Status: Closed
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.0-beta-2, 2.1.0
Fix Version/s: 2.1.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Linux

I'm trying to backup my cluster via cbbackup. I start the backup via
/opt/couchbase/bin/cbbackup couchbase://Administrator:password@mymachine:8091 /backups/couchbase_backup_test
The backup appears to work fine and a progress bar appears, but then exceeds 100% progress and never stops! Example:
^Cinterrupted.###############################] 210.8% (26141950/12403784 msgs)
(I ctrl-C'd to kill it after 200% because it seems that this can't possibly work. Note that both the percent and the number of messages are off). I only have ~12 million items in the bucket, but it went right past that limit when backing up.

Comment by Steve Yen [ 12/Nov/12 ]
There's a couple things going on here that probably need documentation...

- cbbackup first contacts the server to get the # of items. But, if the cluster is changing (there are item mutations), that # of items will just be an estimate.

- then, cbbackup uses the TAP protocol to perform the backup. But, under some conditions (not all item values are resident in memory), the TAP protocol might actually send duplicate messages. That's why cbbackup reports "msgs" for progress instead of "items" in its numerator, but uses "items" in its denominator. That can lead to >100% in some cases.

Whether it leads to >200% is somewhat unexpected, but it depends on the situation and what couchbase server is doing in generating the TAP stream.
Comment by Steve Yen [ 12/Nov/12 ]
The 2.0.2 filter isn't correct at the moment. Putting this into 2.0.1 for the moment as it'll be revisited again.
Comment by Michael L [ 27/Nov/12 ]
(I am the original poster)

Changing the line above to use an IP address rather than hostname seems to have fixed the problem. My backups now run to 100% and then complete as expected.

As for the root cause: I don't believe it has anything to do with the cluster changing, since I first encountered this when trying to backup an essentially idle cluster.
Comment by Bin Cui [ 06/Dec/12 ]
Verified on a multi-node cluster that cbtransfer get the total item number correctly. And fail to reproduce the bug on the idle cluster.
Comment by Mike Wiederhold [ 06/Dec/12 ]
I asked the user on the forums for more information to reproduce this issue. I will post the information here if and when he responds.
Comment by Farshid Ghods (Inactive) [ 10/Dec/12 ]
deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )
Comment by Paul Janssen [ 04/Jan/13 ]
I have the same problem.
Observed progress over 32000%.
The expected msgs to save is far less than the actual msgs saved.
Restore will load the same number of msgs as were actually saved.
This will impact backup and restore time.
This will impact diskspace.
Comment by Paul Janssen [ 04/Jan/13 ]
Version info: 2.0.0 community edition (build-1723)
Comment by Paul Janssen [ 04/Jan/13 ]
Using ip-address (local,external) or hostname (localhost) does not make any difference, issue remains.
Comment by Maria McDuff (Inactive) [ 25/Mar/13 ]
bug scrub: Bin -- have you had a chance to take a look? pls update.
Comment by Bin Cui [ 25/Mar/13 ]
Cannot reproduce it in house.
Comment by Maria McDuff (Inactive) [ 01/Apr/13 ]
per bug scrub: abhinav -- can you please repro in latest 2.0.2 build? thanks.
Comment by Abhinav Dangeti [ 01/Apr/13 ]
Cannot reproduce on 2.0.2-749-rel.
- 3 nodes, 2 buckets
[root@orange-11601 ~]# /opt/couchbase/bin/cbbackup couchbase://Administrator:password@localhost:8091 ~/backup
  [####################] 100.0% (8557766/8558766 msgs)
bucket: default, msgs transferred...
       : total | last | per sec
 batch : 10657 | 10657 | 14.7
 byte : 1131794345 | 1131794345 | 1562011.5
 msg : 8558766 | 8558766 | 11812.1
  [####################] 100.0% (2024739/2024739 msgs)
bucket: saslbucket, msgs transferred...
       : total | last | per sec
 batch : 14775 | 14775 | 86.0
 byte : 1367390279 | 1367390279 | 7955075.3
 msg : 10583505 | 10583505 | 61571.7

Comment by Maria McDuff (Inactive) [ 09/Apr/13 ]
not reproducible.
Comment by Bin Cui [ 17/May/13 ]
First, the error itself is harmless. The tool tried to transfer design docs and the source cluster doesn't contain any. Since 2.0.2, customer can specify --data-only option for cbtransfer/cbback/cbrestore tool.

But we still dont know the root cause why there is such a big difference between the initial msgs to be sent and the final msgs that are transferred.
Comment by Bin Cui [ 17/May/13 ]
One possible explanation about the deviant number:

1. the estimate number is the total active item number
2. the actual msg tranferred = total(tap_mutations + tap_delete)

For the above customer case where they have 2 million item deleted, we will transfer not only the current active items, but also any deleted items.
At again, there will be more msgs transferred if any key will have repeated set/deletions.
Comment by Perry Krug [ 18/May/13 ]
Reopening for visibility. Whether the tool is doing the "right" thing or not, there is still a major impact to the user both in terms of disk space being taken up, time being taken and perception of confidence, etc in the backup.
Comment by Maria McDuff (Inactive) [ 20/May/13 ]
Anil to work with Bin on customer use case.
Comment by Anil Kumar [ 20/May/13 ]
Talked to Bin, here's the update.

There are 2 issues here…
1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2.
2). In case of Deletes on items – tool currently only captures the snapshot of 'active items' and doesn't consider any items getting deleted. Hence when it transfers it not only transfers current active items but also any deleted items which is unnecessary. To fix this we require some changes in EP-Engine side to provide stats on deleted items so that tool can smartly ignore those. Considering the timeframe for release this won't make it for 2.0.2 but we will have documentation explaining this to users. [Doc ticket on Karen ]

Comment by Abhinav Dangeti [ 20/May/13 ]
When cbbackup was run on a node with 6104188 items (~45% active resident ratio),

root@plum-009:~# /opt/couchbase/bin/cbbackup http://localhost:8091 /data/backup
  [##############################] 147.6% (9007021/6104188 msgs)
bucket: default, msgs transferred...
       : total | last | per sec
 batch : 29230 | 29230 | 33.1
 byte : 10030064183 | 10030064183 | 11365547.3
 msg : 9007021 | 9007021 | 10206.3
Comment by Bin Cui [ 20/May/13 ]
Comment by Maria McDuff (Inactive) [ 21/May/13 ]
per bug triage, this fix from bin only addresses item 1 (see below):
1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2.
Comment by Anil Kumar [ 30/May/13 ]
created new bug to track it here : http://www.couchbase.com/issues/browse/MB-8377
Comment by Anil Kumar [ 17/Jul/13 ]
Karen: we need to add this to 2.1 manual
Generated at Tue Sep 16 20:07:38 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.