[MB-6711] size of one vBucket is 10 GB - Rebalance exited with reason replicator_died( many retries to notify CouchDB of update : vbid=294 rev=1) Created: 23/Sep/12 Updated: 26/Oct/12 Resolved: 24/Sep/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket, ns_server |
| Affects Version/s: | 2.0-beta-2 |
| Fix Version/s: | 2.0-beta-2 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Andrei Baranouski | Assignee: | Jin Lim |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | trunk-green-blockers | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
2.0.0-1751-rel
testrunner -i /tmp/rebalance_regression.ini get-logs=True,disabled_consistent_view=False -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=2,num-buckets=1,num-swap=2,swap-orchestrator=False http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/47/consoleFull 2012-09-21 06:26:25.276 menelaus_web_alerts_srv:1:info:message(ns_1@10.3.121.94) - Approaching full disk warning. Usage of disk "/" on node "10.3.121.94" is around 91%. 2012-09-21 06:43:15.462 supervisor_cushion:1:warning:port exited too soon after restart(ns_1@10.3.121.94) - Service memcached exited on node 'ns_1@10.3.121.94' in 0.58s 2012-09-21 06:44:30.881 ns_vbucket_mover:0:critical:message(ns_1@10.3.121.92) - <0.17933.33> exited with {exited, {'EXIT',<0.17965.33>, {replicator_died, {'EXIT',<19812.318.4>,downstream_closed}}}} 2012-09-21 06:44:30.892 ns_memcached:2:info:message(ns_1@10.3.121.96) - Shutting down bucket "bucket-0" on 'ns_1@10.3.121.96' for deletion 2012-09-21 06:44:31.274 ns_orchestrator:2:info:message(ns_1@10.3.121.92) - Rebalance exited with reason {exited, {'EXIT',<0.17965.33>, {replicator_died, {'EXIT',<19812.318.4>,downstream_closed}}}} logs on vm 10.3.121.94 contain many retries to notify CouchDB of update : vbid=294 rev=1 memcached<0.31491.3>: Fri Sep 21 06:12:14.929498 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1 memcached<0.31491.3>: Fri Sep 21 06:12:14.931793 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1 root@ubuntu1104-64:/opt/couchbase/var/lib/couchbase/logs# more info.5 [ns_server:info,2012-09-21T6:06:34.551,ns_1@10.3.121.94:ns_port_memcached:ns_port_server:log:169]memcached<0.31491.3>: Fri Sep 21 06:06:34.3 50714 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1 memcached<0.31491.3>: Fri Sep 21 06:06:34.352976 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1 memcached<0.31491.3>: Fri Sep 21 06:06:34.355496 PDT 3: Retry notify CouchDB of update, vbid=294 rev=1 as a result the size of vbucket 294 is 9.3 GB? root@ubuntu1104-64:/opt/couchbase/var/lib/couchbase/data/bucket-2# ls -la total 9337196 drwxr-xr-x 2 couchbase couchbase 4096 2012-09-21 06:44 . drwxr-xr-x 4 couchbase couchbase 4096 2012-09-21 03:54 .. -rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 0.couch.16 -rw-r--r-- 1 couchbase couchbase 397403 2012-09-21 04:11 147.couch.14 -rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 06:44 148.couch.14 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 149.couch.14 -rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 06:44 150.couch.14 -rw-r--r-- 1 couchbase couchbase 454747 2012-09-21 06:44 151.couch.13 -rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 152.couch.13 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 153.couch.13 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 06:44 154.couch.13 -rw-r--r-- 1 couchbase couchbase 409691 2012-09-21 04:11 172.couch.12 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 173.couch.12 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 174.couch.12 -rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 175.couch.12 -rw-r--r-- 1 couchbase couchbase 397403 2012-09-21 04:11 176.couch.12 -rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 04:11 177.couch.12 -rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 178.couch.11 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 179.couch.11 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 1.couch.16 -rw-r--r-- 1 couchbase couchbase 1355867 2012-09-21 03:54 25.couch.1 -rw-r--r-- 1 couchbase couchbase 9494917305 2012-09-21 06:43 294.couch.1 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 294.couch.11 -rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 295.couch.11 -rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 296.couch.11 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 297.couch.10 -rw-r--r-- 1 couchbase couchbase 409691 2012-09-21 04:11 298.couch.10 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 299.couch.10 -rw-r--r-- 1 couchbase couchbase 389211 2012-09-21 04:11 2.couch.16 -rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 300.couch.10 -rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 301.couch.9 -rw-r--r-- 1 couchbase couchbase 594011 2012-09-21 03:54 33.couch.1 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 343.couch.9 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 344.couch.9 -rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 04:11 345.couch.8 -rw-r--r-- 1 couchbase couchbase 454747 2012-09-21 04:11 346.couch.8 -rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 347.couch.8 -rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 348.couch.8 -rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 349.couch.7 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 350.couch.7 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 351.couch.7 -rw-r--r-- 1 couchbase couchbase 409691 2012-09-21 04:11 352.couch.7 -rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 353.couch.7 -rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 354.couch.6 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 355.couch.6 -rw-r--r-- 1 couchbase couchbase 397403 2012-09-21 04:11 356.couch.6 -rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 357.couch.5 -rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 04:11 358.couch.5 -rw-r--r-- 1 couchbase couchbase 450651 2012-09-21 04:11 359.couch.5 -rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 360.couch.5 -rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 361.couch.5 -rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 362.couch.5 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 363.couch.4 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 364.couch.4 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 365.couch.4 -rw-r--r-- 1 couchbase couchbase 450651 2012-09-21 04:11 366.couch.4 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 367.couch.4 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 392.couch.4 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 393.couch.4 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 394.couch.4 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 395.couch.4 -rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 396.couch.4 -rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 04:11 397.couch.4 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 398.couch.4 -rw-r--r-- 1 couchbase couchbase 23777465 2012-09-21 06:44 399.couch.1 -rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 399.couch.4 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 3.couch.16 -rw-r--r-- 1 couchbase couchbase 999515 2012-09-21 03:54 42.couch.1 -rw-r--r-- 1 couchbase couchbase 450651 2012-09-21 04:11 4.couch.15 -rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 594.couch.3 -rw-r--r-- 1 couchbase couchbase 450651 2012-09-21 04:11 595.couch.3 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 596.couch.3 -rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 597.couch.3 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 598.couch.3 -rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 599.couch.3 -rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 5.couch.15 -rw-r--r-- 1 couchbase couchbase 454747 2012-09-21 04:11 600.couch.3 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 601.couch.2 -rw-r--r-- 1 couchbase couchbase 409691 2012-09-21 04:11 634.couch.2 -rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 04:11 635.couch.2 -rw-r--r-- 1 couchbase couchbase 405595 2012-09-21 04:11 636.couch.2 -rw-r--r-- 1 couchbase couchbase 442459 2012-09-21 04:11 637.couch.2 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 638.couch.2 -rw-r--r-- 1 couchbase couchbase 421979 2012-09-21 04:11 639.couch.2 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 640.couch.2 -rw-r--r-- 1 couchbase couchbase 434267 2012-09-21 04:11 641.couch.2 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 642.couch.2 -rw-r--r-- 1 couchbase couchbase 430171 2012-09-21 04:11 643.couch.2 -rw-r--r-- 1 couchbase couchbase 434267 2012-09-21 04:11 644.couch.2 -rw-r--r-- 1 couchbase couchbase 401499 2012-09-21 04:11 645.couch.2 -rw-r--r-- 1 couchbase couchbase 438363 2012-09-21 04:11 646.couch.2 -rw-r--r-- 1 couchbase couchbase 446555 2012-09-21 04:11 647.couch.2 -rw-r--r-- 1 couchbase couchbase 413787 2012-09-21 04:11 648.couch.2 -rw-r--r-- 1 couchbase couchbase 446555 2012-09-21 04:11 649.couch.2 -rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 04:11 6.couch.15 -rw-r--r-- 1 couchbase couchbase 426075 2012-09-21 04:11 7.couch.15 -rw-r--r-- 1 couchbase couchbase 417883 2012-09-21 04:11 8.couch.15 -rw-r--r-- 1 couchbase couchbase 4130 2012-09-21 04:11 master.couch.16 -rw-r--r-- 1 couchbase couchbase 18915 2012-09-21 04:29 stats.json -rw-r--r-- 1 couchbase couchbase 0 2012-09-21 06:44 stats.json.new -rw-r--r-- 1 couchbase couchbase 18915 2012-09-21 04:28 stats.json.old root@ubuntu1104-64:/opt/couchbase/var/lib/couchbase/data/bucket-2# ^C |
| Comments |
| Comment by Aleksey Kondratenko [ 23/Sep/12 ] |
|
Looks like that directory listing race that we fixed in capi view manager but from perspective of ep-engine. I.e. you can see that vbucket 294 is at file revision 11 while ep-engine apparently tries to write to 1st. My guess is that ep-engine did not see 294.11 due to race in vbucket listing versus end of compaction, and started writing revision 1 assuming there's no vbucket file. And perhaps then something is broken in retry logic |
| Comment by Farshid Ghods [ 23/Sep/12 ] |
| I think this is due to persistence not working on the latest builds |
| Comment by Farshid Ghods [ 23/Sep/12 ] |
| Assigning to epengine team |
| Comment by Andrei Baranouski [ 23/Sep/12 ] |
|
set this bug as critical because there are many tests that are failed and this issue causes the opening of new bugs:
tests: 1) http://qa.hq.northscale.net/job/centos-64-2.0-view-query-extended-tests/70/consoleFull 2012-09-22 17:39:44,219] - [rest_client:96] INFO - existing buckets : [] [2012-09-22 17:39:44,226] - [rest_client:1234] INFO - http://10.2.2.60:8091/pools/default/buckets with param: proxyPort=11211&bucketType=membase&authType=sasl&replicaIndex=1&name=default&saslPassword=&replicaNumber=1&ramQuotaMB=1456 [2012-09-22 17:39:44,236] - [rest_client:582] ERROR - http://10.2.2.60:8091/pools/default/buckets error 503 reason: unknown {"_":"Bucket with given name still exists"} [2012-09-22 17:39:44,246] - [bucket_helper:124] INFO - deleting existing buckets on [ip:10.2.2.60 port:8091 ssh_username:root, ip:10.2.2.108 port:8091 ssh_username:root, ip:10.2.2.63 port:8091 ssh_username:root, ip:10.2.2.64 port:8091 ssh_username:root, ip:10.2.2.65 port:8091 ssh_username:root] [2012-09-22 17:39:44,321] - [cluster_helper:199] INFO - rebalancing all nodes in order to remove nodes [2012-09-22 17:39:44,326] - [rest_client:826] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.2.2.108&user=Administrator&knownNodes=ns_1%4010.2.2.60%2Cns_1%4010.2.2.108 [2012-09-22 17:39:44,331] - [rest_client:833] INFO - rebalance operation started [2012-09-22 17:39:44,336] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:39:46,341] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:39:48,345] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:39:50,350] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:39:52,354] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:39:54,359] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:39:56,368] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:39:58,372] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:40:00,377] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:40:02,381] - [rest_client:929] INFO - rebalance percentage : 0 % [2012-09-22 17:40:04,386] - [rest_client:914] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed ERROR this show that user can trigger rebalance ( that will be failed) when 'deleted bucket is not deleted' (separate bug?) 2)http://qa.hq.northscale.net/job/centos-64-2.0-new-rebalance/77/consoleFull rebalance hangs on the same progress, disk size is growing and rebalance will falls due to lack of space andrei ~/repository/testrunner $ scripts/ssh.py -i andrei_rebalance.ini "ls -la /opt/couchbase/var/lib/couchbase/data/default" 10.3.3.91 total 29165028 drwxr-xr-x 2 couchbase couchbase 4096 Sep 22 17:25 . drwxr-xr-x 4 couchbase couchbase 4096 Sep 22 16:55 .. -rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:56 56.couch.2 ... -rw-r--r-- 1 couchbase couchbase 29833224378 Sep 23 01:30 95.couch.1 -rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:56 95.couch.2 .... ls: /opt/couchbase/var/lib/couchbase/data/default: No such file or directory 10.3.3.99 total 32114992 drwxr-xr-x 2 couchbase couchbase 4096 Sep 22 17:24 . drwxr-xr-x 4 couchbase couchbase 4096 Sep 22 16:54 .. -rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:57 103.couch.2 .... -rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:57 69.couch.2 -rw-r--r-- 1 couchbase couchbase 32847953966 Sep 23 01:30 70.couch.1 -rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:57 70.couch.2 ....... 10.3.3.82 total 29765164 drwxr-xr-x 2 couchbase couchbase 4096 Sep 22 17:25 . drwxr-xr-x 4 couchbase couchbase 4096 Sep 22 16:55 .. -rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:56 100.couch.2 ..... -rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:56 94.couch.2 -rw-r--r-- 1 couchbase couchbase 30447755449 Sep 23 01:30 95.couch.1 -rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:56 95.couch.2 ...... 10.3.3.93 total 35250148 drwxr-xr-x 2 couchbase couchbase 4096 Sep 22 16:55 . drwxr-xr-x 4 couchbase couchbase 4096 Sep 22 16:54 .. -rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:55 0.couch.3 ..... -rw-r--r-- 1 couchbase couchbase 77915 Sep 22 16:55 5.couch.3 -rw-r--r-- 1 couchbase couchbase 36052365358 Sep 23 01:30 50.couch.1 -rw-r--r-- 1 couchbase couchbase 82011 Sep 22 16:54 50.couch.2 .. |
| Comment by Andrei Baranouski [ 23/Sep/12 ] |
| even as Blocker |
| Comment by Chiyoung Seo [ 23/Sep/12 ] |
|
Jin, I think this is a regression from our recent changes that fixed the windows issue. |
| Comment by Farshid Ghods [ 24/Sep/12 ] |
|
http://review.couchbase.org/#/c/21056/
fix was merged |
| Comment by Thuan Nguyen [ 24/Sep/12 ] |
|
Integrated in github-ep-engine-2-0 #433 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/433/]) Result = SUCCESS Jin Lim : Files : * src/couch-kvstore/couch-kvstore.cc |
| Comment by Farshid Ghods [ 25/Sep/12 ] |
|
Andrei
Please close the issue if it's resolved in the latest builds |
| Comment by Andrei Baranouski [ 25/Sep/12 ] |
| yes, it was fixed in 1757 |
| Comment by Karen Zeller [ 26/Oct/12 ] |
|
RN: Replication had exited with replicator_died message after
multiple attempts to update. Problem was caused by using old revision numbers for new database files. Now new database files use new revision numbers, resolving the problem. |