[MB-6711] size of one vBucket is 10 GB - Rebalance exited with reason replicator_died( many retries to notify CouchDB of update : vbid=294 rev=1) Created: 23/Sep/12  Updated: 26/Oct/12  Resolved: 24/Sep/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 2.0-beta-2
Fix Version/s: 2.0-beta-2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Andrei Baranouski Assignee: Jin Lim
Resolution: Fixed Votes: 0
Labels: trunk-green-blockers
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive 3a0ad78b-cf67-4e70-8d24-31eba49e4a41-10.3.121.92-diag.txt.gz     GZip Archive 3a0ad78b-cf67-4e70-8d24-31eba49e4a41-10.3.121.93-diag.txt.gz     GZip Archive 3a0ad78b-cf67-4e70-8d24-31eba49e4a41-10.3.121.95-diag.txt.gz     GZip Archive 3a0ad78b-cf67-4e70-8d24-31eba49e4a41-10.3.121.96-diag.txt.gz     GZip Archive 3a0ad78b-cf67-4e70-8d24-31eba49e4a41-10.3.121.97-diag.txt.gz     GZip Archive 3a0ad78b-cf67-4e70-8d24-31eba49e4a41-10.3.121.98-diag.txt.gz     GZip Archive logs_94_1.tar.gz     GZip Archive logs_94.tar.gz    

 Description   
2.0.0-1751-rel

testrunner -i /tmp/rebalance_regression.ini get-logs=True,disabled_consistent_view=False -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=2,num-buckets=1,num-swap=2,swap-orchestrator=False
http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/47/consoleFull



2012-09-21 06:26:25.276 menelaus_web_alerts_srv:1:info:message(ns_1@10.3.121.94) - Approaching full disk warning. Usage of disk "/" on node "10.3.121.94" is around 91%.
2012-09-21 06:43:15.462 supervisor_cushion:1:warning:port exited too soon after restart(ns_1@10.3.121.94) - Service memcached exited on node 'ns_1@10.3.121.94' in 0.58s

2012-09-21 06:44:30.881 ns_vbucket_mover:0:critical:message(ns_1@10.3.121.92) - <0.17933.33> exited with {exited,
                          {'EXIT',<0.17965.33>,
                           {replicator_died,
                            {'EXIT',<19812.318.4>,downstream_closed}}}}
2012-09-21 06:44:30.892 ns_memcached:2:info:message(ns_1@10.3.121.96) - Shutting down bucket "bucket-0" on 'ns_1@10.3.121.96' for deletion
2012-09-21 06:44:31.274 ns_orchestrator:2:info:message(ns_1@10.3.121.92) - Rebalance exited with reason {exited,
                              {'EXIT',<0.17965.33>,
                               {replicator_died,
                                {'EXIT',<19812.318.4>,downstream_closed}}}}


logs on vm 10.3.121.94 contain many retries to notify CouchDB of update : vbid=294 rev=1

memcached<0.31491.3>: Fri Sep 21 06:12:14.929498 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1
memcached<0.31491.3>: Fri Sep 21 06:12:14.931793 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1
root@ubuntu1104-64:/opt/couchbase/var/lib/couchbase/logs# more info.5
[ns_server:info,2012-09-21T6:06:34.551,ns_1@10.3.121.94:ns_port_memcached:ns_port_server:log:169]memcached<0.31491.3>: Fri Sep 21 06:06:34.3
50714 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1
memcached<0.31491.3>: Fri Sep 21 06:06:34.352976 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1
memcached<0.31491.3>: Fri Sep 21 06:06:34.355496 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1


as a result the size of vbucket 294 is 9.3 GB?


root@ubuntu1104-64:/opt/couchbase/var/lib/couchbase/data/bucket-2# ls -la
total 9337196
drwxr-xr-x 2 couchbase couchbase 4096 2012-09-21 06:44 .
drwxr-xr-x 4 couchbase couchbase 4096 2012-09-21 03:54 ..
-rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 0.couch.16
-rw-r--r-- 1 couchbase couchbase 397403 2012-09-21 04:11 147.couch.14
-rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 06:44 148.couch.14
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 149.couch.14
-rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 06:44 150.couch.14
-rw-r--r-- 1 couchbase couchbase 454747 2012-09-21 06:44 151.couch.13
-rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 152.couch.13
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 153.couch.13
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 06:44 154.couch.13
-rw-r--r-- 1 couchbase couchbase 409691 2012-09-21 04:11 172.couch.12
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 173.couch.12
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 174.couch.12
-rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 175.couch.12
-rw-r--r-- 1 couchbase couchbase 397403 2012-09-21 04:11 176.couch.12
-rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 04:11 177.couch.12
-rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 178.couch.11
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 179.couch.11
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 1.couch.16
-rw-r--r-- 1 couchbase couchbase 1355867 2012-09-21 03:54 25.couch.1
-rw-r--r-- 1 couchbase couchbase 9494917305 2012-09-21 06:43 294.couch.1
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 294.couch.11
-rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 295.couch.11
-rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 296.couch.11
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 297.couch.10
-rw-r--r-- 1 couchbase couchbase 409691 2012-09-21 04:11 298.couch.10
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 299.couch.10
-rw-r--r-- 1 couchbase couchbase 389211 2012-09-21 04:11 2.couch.16
-rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 300.couch.10
-rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 301.couch.9
-rw-r--r-- 1 couchbase couchbase 594011 2012-09-21 03:54 33.couch.1
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 343.couch.9
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 344.couch.9
-rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 04:11 345.couch.8
-rw-r--r-- 1 couchbase couchbase 454747 2012-09-21 04:11 346.couch.8
-rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 347.couch.8
-rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 348.couch.8
-rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 349.couch.7
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 350.couch.7
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 351.couch.7
-rw-r--r-- 1 couchbase couchbase 409691 2012-09-21 04:11 352.couch.7
-rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 353.couch.7
-rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 354.couch.6
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 355.couch.6
-rw-r--r-- 1 couchbase couchbase 397403 2012-09-21 04:11 356.couch.6
-rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 357.couch.5
-rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 04:11 358.couch.5
-rw-r--r-- 1 couchbase couchbase 450651 2012-09-21 04:11 359.couch.5
-rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 360.couch.5
-rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 361.couch.5
-rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 362.couch.5
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 363.couch.4
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 364.couch.4
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 365.couch.4
-rw-r--r-- 1 couchbase couchbase 450651 2012-09-21 04:11 366.couch.4
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 367.couch.4
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 392.couch.4
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 393.couch.4
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 394.couch.4
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 395.couch.4
-rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 396.couch.4
-rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 04:11 397.couch.4
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 398.couch.4
-rw-r--r-- 1 couchbase couchbase 23777465 2012-09-21 06:44 399.couch.1
-rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 399.couch.4
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 3.couch.16
-rw-r--r-- 1 couchbase couchbase 999515 2012-09-21 03:54 42.couch.1
-rw-r--r-- 1 couchbase couchbase 450651 2012-09-21 04:11 4.couch.15
-rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 594.couch.3
-rw-r--r-- 1 couchbase couchbase 450651 2012-09-21 04:11 595.couch.3
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 596.couch.3
-rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 597.couch.3
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 598.couch.3
-rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 599.couch.3
-rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 5.couch.15
-rw-r--r-- 1 couchbase couchbase 454747 2012-09-21 04:11 600.couch.3
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 601.couch.2
-rw-r--r-- 1 couchbase couchbase 409691 2012-09-21 04:11 634.couch.2
-rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 04:11 635.couch.2
-rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 636.couch.2
-rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 637.couch.2
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 638.couch.2
-rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 639.couch.2
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 640.couch.2
-rw-r--r-- 1 couchbase couchbase 434267 2012-09-21 04:11 641.couch.2
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 642.couch.2
-rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 643.couch.2
-rw-r--r-- 1 couchbase couchbase 434267 2012-09-21 04:11 644.couch.2
-rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 645.couch.2
-rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 04:11 646.couch.2
-rw-r--r-- 1 couchbase couchbase 446555 2012-09-21 04:11 647.couch.2
-rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 648.couch.2
-rw-r--r-- 1 couchbase couchbase 446555 2012-09-21 04:11 649.couch.2
-rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 04:11 6.couch.15
-rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 7.couch.15
-rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 04:11 8.couch.15
-rw-r--r-- 1 couchbase couchbase 4130 2012-09-21 04:11 master.couch.16
-rw-r--r-- 1 couchbase couchbase 18915 2012-09-21 04:29 stats.json
-rw-r--r-- 1 couchbase couchbase 0 2012-09-21 06:44 stats.json.new
-rw-r--r-- 1 couchbase couchbase 18915 2012-09-21 04:28 stats.json.old
root@ubuntu1104-64:/opt/couchbase/var/lib/couchbase/data/bucket-2# ^C








 Comments   
Comment by Aleksey Kondratenko [ 23/Sep/12 ]
Looks like that directory listing race that we fixed in capi view manager but from perspective of ep-engine.

I.e. you can see that vbucket 294 is at file revision 11 while ep-engine apparently tries to write to 1st.

My guess is that ep-engine did not see 294.11 due to race in vbucket listing versus end of compaction, and started writing revision 1 assuming there's no vbucket file. And perhaps then something is broken in retry logic
Comment by Farshid Ghods (Inactive) [ 23/Sep/12 ]
I think this is due to persistence not working on the latest builds
Comment by Farshid Ghods (Inactive) [ 23/Sep/12 ]
Assigning to epengine team
Comment by Andrei Baranouski [ 23/Sep/12 ]
set this bug as critical because there are many tests that are failed and this issue causes the opening of new bugs:
MB-6710 Deleting bucket on a cluster gives error "Some nodes are still deleting bucket"
MB-6712 API pools/default/buckets/ doesn't return any buckets but attempt to create bucket gives: Bucket with given name still exists

tests:
1)
http://qa.hq.northscale.net/job/centos-64-2.0-view-query-extended-tests/70/consoleFull

2012-09-22 17:39:44,219] - [rest_client:96] INFO - existing buckets : []
[2012-09-22 17:39:44,226] - [rest_client:1234] INFO - http://10.2.2.60:8091/pools/default/buckets with param: proxyPort=11211&bucketType=membase&authType=sasl&replicaIndex=1&name=default&saslPassword=&replicaNumber=1&ramQuotaMB=1456
[2012-09-22 17:39:44,236] - [rest_client:582] ERROR - http://10.2.2.60:8091/pools/default/buckets error 503 reason: unknown {"_":"Bucket with given name still exists"}
[2012-09-22 17:39:44,246] - [bucket_helper:124] INFO - deleting existing buckets on [ip:10.2.2.60 port:8091 ssh_username:root, ip:10.2.2.108 port:8091 ssh_username:root, ip:10.2.2.63 port:8091 ssh_username:root, ip:10.2.2.64 port:8091 ssh_username:root, ip:10.2.2.65 port:8091 ssh_username:root]
[2012-09-22 17:39:44,321] - [cluster_helper:199] INFO - rebalancing all nodes in order to remove nodes
[2012-09-22 17:39:44,326] - [rest_client:826] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.2.2.108&user=Administrator&knownNodes=ns_1%4010.2.2.60%2Cns_1%4010.2.2.108
[2012-09-22 17:39:44,331] - [rest_client:833] INFO - rebalance operation started
[2012-09-22 17:39:44,336] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:39:46,341] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:39:48,345] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:39:50,350] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:39:52,354] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:39:54,359] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:39:56,368] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:39:58,372] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:40:00,377] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:40:02,381] - [rest_client:929] INFO - rebalance percentage : 0 %
[2012-09-22 17:40:04,386] - [rest_client:914] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
ERROR


this show that user can trigger rebalance ( that will be failed) when 'deleted bucket is not deleted' (separate bug?)


2)http://qa.hq.northscale.net/job/centos-64-2.0-new-rebalance/77/consoleFull

rebalance hangs on the same progress, disk size is growing and rebalance will falls due to lack of space

andrei ~/repository/testrunner $ scripts/ssh.py -i andrei_rebalance.ini "ls -la /opt/couchbase/var/lib/couchbase/data/default"
10.3.3.91
total 29165028
drwxr-xr-x 2 couchbase couchbase 4096 Sep 22 17:25 .
drwxr-xr-x 4 couchbase couchbase 4096 Sep 22 16:55 ..
-rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:56 56.couch.2
...
-rw-r--r-- 1 couchbase couchbase 29833224378 Sep 23 01:30 95.couch.1
-rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:56 95.couch.2
....

ls: /opt/couchbase/var/lib/couchbase/data/default: No such file or directory
10.3.3.99
total 32114992
drwxr-xr-x 2 couchbase couchbase 4096 Sep 22 17:24 .
drwxr-xr-x 4 couchbase couchbase 4096 Sep 22 16:54 ..
-rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:57 103.couch.2
....
-rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:57 69.couch.2
-rw-r--r-- 1 couchbase couchbase 32847953966 Sep 23 01:30 70.couch.1
-rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:57 70.couch.2
.......


10.3.3.82
total 29765164
drwxr-xr-x 2 couchbase couchbase 4096 Sep 22 17:25 .
drwxr-xr-x 4 couchbase couchbase 4096 Sep 22 16:55 ..
-rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:56 100.couch.2
.....
-rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:56 94.couch.2
-rw-r--r-- 1 couchbase couchbase 30447755449 Sep 23 01:30 95.couch.1
-rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:56 95.couch.2
......


10.3.3.93
total 35250148
drwxr-xr-x 2 couchbase couchbase 4096 Sep 22 16:55 .
drwxr-xr-x 4 couchbase couchbase 4096 Sep 22 16:54 ..
-rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:55 0.couch.3
.....
-rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:55 5.couch.3
-rw-r--r-- 1 couchbase couchbase 36052365358 Sep 23 01:30 50.couch.1
-rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:54 50.couch.2
..


Comment by Andrei Baranouski [ 23/Sep/12 ]
even as Blocker
Comment by Chiyoung Seo [ 23/Sep/12 ]
Jin,

I think this is a regression from our recent changes that fixed the windows issue.
Comment by Farshid Ghods (Inactive) [ 24/Sep/12 ]
http://review.couchbase.org/#/c/21056/

fix was merged
Comment by Thuan Nguyen [ 24/Sep/12 ]
Integrated in github-ep-engine-2-0 #433 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/433/])
    MB-6711 Do not create new db file with old revision number (Revision cbc03de3b0cda2b7d7bb8bbddcdf161e4b6c0f84)

     Result = SUCCESS
Jin Lim :
Files :
* src/couch-kvstore/couch-kvstore.cc
Comment by Farshid Ghods (Inactive) [ 25/Sep/12 ]
Andrei
Please close the issue if it's resolved in the latest builds
Comment by Andrei Baranouski [ 25/Sep/12 ]
yes, it was fixed in 1757
Comment by kzeller [ 26/Oct/12 ]
RN: Replication had exited with replicator_died message after
multiple attempts to update. Problem was caused by using
old revision numbers for new database files. Now new database
files use new revision numbers, resolving the problem.
Generated at Thu Aug 21 06:49:38 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.