[MB-5095] Failures on Backing up of Data with Large WAL size on 1.8.1 throws error : Database disk image is malformed Created: 16/Apr/12  Updated: 10/Jan/13  Resolved: 03/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: tools
Affects Version/s: 1.8.1
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Ketaki Gangal Assignee: Steve Yen
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Operating System : Ubuntu Single Node
Branch : 1.8.1-753


 Description   
Backup fails for files w/ large size WAL[ 5M plus]. Backup is successful for smaller WAL sizes [default :1000].

Steps to Reproduce this issue
------------------------------------------------------

Load 1M of data : [/opt/couchbase/bin/memcachetest -h localhost:11211 -i 1000000 -M 1024 -K cb_2 -l ]
Run a backup of the data [ while database is online] : sudo -su couchbase /opt/couchbase/bin/cbbackup /opt/couchbase/var/lib/couchbase/data/default-data/default /tmp/rev7

Errors
--------------------------------------------------------
..
Backup of default done, check integrity now
ok
Vacuum of default done
Backup of default-0.mb done, check integrity now
*** in database main ***
On tree page 395901 cell 3: 2nd reference to page 395903
On tree page 426067 cell 0: 2nd reference to page 410862
On tree page 426067 cell 1: 2nd reference to page 410863
.
.
.
.

On tree page 461483 cell 2: 2nd reference to page 462480
On tree page 467250 cell 1: 2nd reference to page 440802
On tree page 472512 cell 1: 2nd reference to page 471560
On tree page 358585 cell 1: 2nd reference to page 358586
On tree page 474942 cell 1: 2nd reference to page 473623
Page 491172: btreeInitPage() returns error code 11
On tree page 485795 cell 89: Child page depth differs
Page 491173: btreeInitPage() returns error code 11
Page 491175: btreeInitPage() returns error code 11
On tree page 336498 cell 0: 2nd reference to page 336497
On tree page 385778 cell 2: 2nd reference to page 385777
On tree page 394903 cell 1: 2nd reference to page 394901
On tree page 427473 cell 1: 2nd reference to page 427474
On tree page 436046 cell 0: 2nd reference to page 436045
On tree page 437909 cell 0: 2nd reference to page 437907
On tree page 444827 cell 1: 2nd reference to page 444046
On tree page 467244 cell 0: 2nd reference to page 471558
On tree page 483494 cell 0: 2nd reference to page 469621
On tree page 376439 cell 1: 2nd reference to page 376440
On tree page 376453 cell 1: 2nd reference to page 376412
On tree page 399713 cell 2: 2nd reference to page 399714
On tree page 411158 cell 1: 2nd reference to page 412526
On tree page 426543 cell 0: 2nd reference to page 362805
Page 491168: btreeInitPage() returns error code 11
On tree page 482558 cell 86: Child page depth differs
Page 491169: btreeInitPage() returns error code 11
Page 491170: btreeInitPage() returns error code 11
Page 491171: btreeInitPage() returns error code 11
On tree page 385413 cell 1: 2nd reference to page 385411
On tree page 407221 cell 1: 2nd reference to page 407222
On tree page 416240 cell 1: 2nd reference to page 416241
On tree page 426539 cell 1: 2nd reference to page 427355
On tree page 430171 cell 2: 2nd reference to page 429186
On tree page 433173 cell 0: 2nd reference to page 433172
On tree page 434095 cell 1: 2nd reference to page 434096
On tree page 439678 cell 0: 2nd reference to page 439677
On tree page 479441 cell 1: 2nd reference to page 479442
Page 491162: btreeInitPage() returns error code 11
On tree page 487719 cell 79: Child page depth differs
Page 491163: btreeInitPage() returns error code 11
Page 491166: btreeInitPage() returns error code 11
On tree page 376307 cell 1: 2nd reference to page 376308
On tree page 385403 cell 1: 2nd reference to page 388156
On tree page 428240 cell 1: 2nd reference to page 428241
On tree page 430165 cell 3: 2nd reference to page 430166
Error: database disk image is malformed
Vacuum of default-3.mb done


 Comments   
Comment by Dipti Borkar [ 14/Jun/12 ]
is there a workaround?
Comment by Karan Kumar (Inactive) [ 14/Jun/12 ]
Ohh.. Sorry I posted the wrong comment..
Comment by Karan Kumar (Inactive) [ 14/Jun/12 ]
@Steve: Not sure if you have taken a look at this?.. This bug was wrongly tagged and did not show up in the filter.
Comment by Steve Yen [ 15/Jun/12 ]
Still trying to reproduce. Running single node centos 1.8.1-910, with 3M items created via a concurrent mix of memcachetest and mcsoda...

  /opt/couchbase/bin/memcachetest -h localhost:11211 -i 1000000 -M 1024 -K cb_0 -l -P 95
  ./pytests/performance/mcsoda.py membase://HOST:8091 max-items=200000 ratio-sets=1.0 vbuckets=1024 doc-gen=0

While the client load tools were running, backup took forever...

  /opt/couchbase/bin/cbbackup /opt/couchbase/var/lib/couchbase/data/default-data/default /tmp/backup0

After stopping the client load tools, the backup eventually finished.

WAL sizes were >1MB, but no malformed issues so far...

# ls -al /opt/couchbase/var/lib/couchbase/data/default-data/total 1443244
drwxr-xr-x 2 couchbase couchbase 4096 2012-06-15 05:11 .
drwxr-xr-x 3 couchbase couchbase 4096 2012-06-14 08:48 ..
-rw-r--r-- 1 couchbase couchbase 52224 2012-06-15 04:57 default
-rw-r--r-- 1 couchbase couchbase 489541632 2012-06-15 05:09 default-0.mb
-rw-r--r-- 1 couchbase couchbase 32768 2012-06-15 05:08 default-0.mb-shm
-rw-r--r-- 1 couchbase couchbase 2618984 2012-06-15 05:09 default-0.mb-wal
-rw-r--r-- 1 couchbase couchbase 326151168 2012-06-15 05:09 default-1.mb
-rw-r--r-- 1 couchbase couchbase 32768 2012-06-15 05:08 default-1.mb-shm
-rw-r--r-- 1 couchbase couchbase 1618144 2012-06-15 05:09 default-1.mb-wal
-rw-r--r-- 1 couchbase couchbase 327179264 2012-06-15 05:09 default-2.mb
-rw-r--r-- 1 couchbase couchbase 32768 2012-06-15 05:08 default-2.mb-shm
-rw-r--r-- 1 couchbase couchbase 1473520 2012-06-15 05:09 default-2.mb-wal
-rw-r--r-- 1 couchbase couchbase 326189056 2012-06-15 05:09 default-3.mb
-rw-r--r-- 1 couchbase couchbase 32768 2012-06-15 05:08 default-3.mb-shm
-rw-r--r-- 1 couchbase couchbase 1768008 2012-06-15 05:09 default-3.mb-wal
-rw-r--r-- 1 couchbase couchbase 32768 2012-06-15 05:11 default-shm
-rw-r--r-- 1 couchbase couchbase 1084712 2012-06-15 05:11 default-wal

Comment by Steve Yen [ 15/Jun/12 ]
When I run memcachetest/mcsoda at full speed, then cbbackup appears to not make any progress. It's likely that cbbackup is unable to acquire file locks, since ep-engine has them and isn't letting go.

When I run client-load-tools at a slower ops/second (max-ops-per-sec), then cbbackup does finish...

    ./pytests/performance/mcsoda.py membase://10.3.121.192:8091 max-items=200000 ratio-sets=0.1 vbuckets=1024 doc-gen=0 cur-items=200000 max-ops-per-sec=10

In either case, I haven't reproduced the "image is malformed" issue yet.
Comment by Dipti Borkar [ 21/Jun/12 ]
is this using old cbbackup or tap backup? is this still current sprint / P0 ?
Comment by Farshid Ghods (Inactive) [ 21/Jun/12 ]
wait until wal file size is less than 10 MB before running cbbackup
Generated at Sun Sep 21 17:36:57 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.