Details
Description
another steps:
1. 3 nodes
2. reboot one node, and wait while ep_warmup_thread: complete
3. add 4-th node and rebalance
rebalance is failed with the same stack trace
1. 3 nodes
2. reboot one node, and wait while ep_warmup_thread: complete
3. add 4-th node and rebalance
rebalance is failed with the same stack trace
-
- 10.3.121.13-8091-diag.txt.gz
- 17/Aug/12 9:33 PM
- 13.78 MB
- Andrei Baranouski
-
- 10.3.121.14-8091-diag.txt.gz
- 17/Aug/12 9:33 PM
- 8.77 MB
- Andrei Baranouski
-
- 10.3.121.15-8091-diag.txt.gz
- 17/Aug/12 9:33 PM
- 9.02 MB
- Andrei Baranouski
-
- 10.3.121.16-8091-diag.txt.gz
- 17/Aug/12 9:33 PM
- 3.30 MB
- Andrei Baranouski
-
- cbcollect13.zip
- 17/Aug/12 9:33 PM
- 9.29 MB
- Andrei Baranouski
-
Hide
- cbcollect15.zip
- 17/Aug/12 9:33 PM
- 16.65 MB
- Andrei Baranouski
-
- cbcollect_info_20120818-022505/couchbase.log 1.51 MB
- cbcollect_info_20120818-022505/ns_server.couchdb.log 22.91 MB
- cbcollect_info_20120818-022505/stats.log 1.33 MB
- cbcollect_info_20120818-022505/ns_server.error.log 79 kB
- cbcollect_info_20120818-022505/ns_server.info.log 70.01 MB
- cbcollect_info_20120818-022505/ns_server.views.log 1.86 MB
- cbcollect_info_20120818-022505/diag.log 57.04 MB
- cbcollect_info_20120818-022505/ns_server.debug.log 91.48 MB
-
Hide
- memcachedlogs.zip
- 18/Aug/12 8:36 PM
- 135 kB
- Farshid Ghods
-
- 10.3.121.13/log/memcached.log.0.gz 0.0 kB
- __MACOSX/.../._memcached.log.0.gz 0.7 kB
- 10.3.121.13/log/memcached.log.1.gz 192 kB
- __MACOSX/.../._memcached.log.1.gz 0.7 kB
- 10.3.121.14/log/memcached.log.0.gz 0.0 kB
- __MACOSX/.../._memcached.log.0.gz 0.7 kB
- 10.3.121.14/log/memcached.log.1.gz 0.0 kB
- __MACOSX/.../._memcached.log.1.gz 0.7 kB
- 10.3.121.14/log/memcached.log.2.gz 0.0 kB
- __MACOSX/.../._memcached.log.2.gz 0.7 kB
- 10.3.121.15/log/memcached.log.0.gz 0.0 kB
- __MACOSX/.../._memcached.log.0.gz 0.7 kB
- 10.3.121.15/log/memcached.log.1.gz 0.3 kB
- __MACOSX/.../._memcached.log.1.gz 0.7 kB
- 10.3.121.15/log/memcached.log.2.gz 0.0 kB
- __MACOSX/.../._memcached.log.2.gz 0.7 kB
- 10.3.121.15/log/memcached.log.3.gz 0.0 kB
- __MACOSX/.../._memcached.log.3.gz 0.7 kB
- 10.3.121.16/log/memcached.log.0.gz 0.0 kB
- __MACOSX/.../._memcached.log.0.gz 0.7 kB
- 10.3.121.16/log/memcached.log.1.gz 0.0 kB
- __MACOSX/.../._memcached.log.1.gz 0.7 kB
- 10.3.121.16/log/memcached.log.2.gz 0.0 kB
- __MACOSX/.../._memcached.log.2.gz 0.7 kB
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Farshid Ghods
added a comment -
diags : http://www.couchbase.com/issues/secure/attachment/14374/10.3.121.16-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14373/10.3.121.15-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14372/10.3.121.14-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14371/10.3.121.13-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14373/10.3.121.15-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14372/10.3.121.14-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14371/10.3.121.13-8091-diag.txt.gz
Show
Farshid Ghods
added a comment - diags : http://www.couchbase.com/issues/secure/attachment/14374/10.3.121.16-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14373/10.3.121.15-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14372/10.3.121.14-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14371/10.3.121.13-8091-diag.txt.gz
Hide
after the same steps I tried to repeat the rebalance again. so, rebalance was failed in the second time, but was passed in third.
new diags are attached
also noticed that several items had been lost:
sasl bucket: before- 3198589, after- 3198310
default: before - 4401961, after - 4401508
new diags are attached
also noticed that several items had been lost:
sasl bucket: before- 3198589, after- 3198310
default: before - 4401961, after - 4401508
Show
Andrei Baranouski
added a comment - - edited after the same steps I tried to repeat the rebalance again. so, rebalance was failed in the second time, but was passed in third.
new diags are attached
also noticed that several items had been lost:
sasl bucket: before- 3198589, after- 3198310
default: before - 4401961, after - 4401508
Show
Peter Wansch
added a comment - Alk, I believe this is the most important one to address next.
Hide
Aleksey Kondratenko
added a comment -
I'd like us to start gathering memcached logs for:
*) all bugreports (manually by QE at least for now)
*) as part of cbcollect_info
*) all bugreports (manually by QE at least for now)
*) as part of cbcollect_info
Show
Aleksey Kondratenko
added a comment - I'd like us to start gathering memcached logs for:
*) all bugreports (manually by QE at least for now)
*) as part of cbcollect_info
Hide
Farshid Ghods
added a comment -
notified the team about this requirements. about this bug cbcollect info is uploaded but i am going to ssh and see if i can extract memcached.log and attach it seperately now
Show
Farshid Ghods
added a comment - notified the team about this requirements. about this bug cbcollect info is uploaded but i am going to ssh and see if i can extract memcached.log and attach it seperately now
Hide
Farshid Ghods
added a comment -
rebalance succeeded after three attempts oin this cluster
i grabbed memcached log files from all nodes anyways
i grabbed memcached log files from all nodes anyways
Show
Farshid Ghods
added a comment - rebalance succeeded after three attempts oin this cluster
i grabbed memcached log files from all nodes anyways
Hide
Aleksey Kondratenko
added a comment -
Sorry folks, I was not in my best shape today. That's 100% duplicate of that bug Tony filed. You may need 3-4, up to <number of nodes> - 1 rebalances in order to clean up those dead connections.
SeeMB-6216
See
Show
Aleksey Kondratenko
added a comment - Sorry folks, I was not in my best shape today. That's 100% duplicate of that bug Tony filed. You may need 3-4, up to <number of nodes> - 1 rebalances in order to clean up those dead connections.
See MB-6216
Hide
Farshid Ghods
added a comment -
Thanks Alk
Is there another workaround like running a command or rebooting that node twice instead of once to kill those dead connections or the only way is to rebalance again
Is there another workaround like running a command or rebooting that node twice instead of once to kill those dead connections or the only way is to rebalance again
Show
Farshid Ghods
added a comment - Thanks Alk
Is there another workaround like running a command or rebooting that node twice instead of once to kill those dead connections or the only way is to rebalance again
Hide
Thuan Nguyen
added a comment -
Following these steps
1. 3 nodes 10.3.121.23, 24, 25
2. reboot node 24, and wait while ep_warmup_thread: complete
3. add 4-th node (26) and rebalance
Successful reblance in first run. It took more than 2 hrs 34 minutes to finish. During rebalance, node 24 become unstable (yellow). I monitor atop in node 24 and see disk use 100%. Then I disable compaction in UI both data and view. Node 24 back to green. Eventhough compaction was disable, compaction in view5 production still running. When ever view5 production running compaction, rebalance process was paused (look at vbucket moving between nodes). That made rebalance took more than 2 hours to finish.
Cluster has 16 millions items with resident ratio 55%, 12 dev views, 2 pro views, load about 3K ops
1. 3 nodes 10.3.121.23, 24, 25
2. reboot node 24, and wait while ep_warmup_thread: complete
3. add 4-th node (26) and rebalance
Successful reblance in first run. It took more than 2 hrs 34 minutes to finish. During rebalance, node 24 become unstable (yellow). I monitor atop in node 24 and see disk use 100%. Then I disable compaction in UI both data and view. Node 24 back to green. Eventhough compaction was disable, compaction in view5 production still running. When ever view5 production running compaction, rebalance process was paused (look at vbucket moving between nodes). That made rebalance took more than 2 hours to finish.
Cluster has 16 millions items with resident ratio 55%, 12 dev views, 2 pro views, load about 3K ops
Show
Thuan Nguyen
added a comment - Following these steps
1. 3 nodes 10.3.121.23, 24, 25
2. reboot node 24, and wait while ep_warmup_thread: complete
3. add 4-th node (26) and rebalance
Successful reblance in first run. It took more than 2 hrs 34 minutes to finish. During rebalance, node 24 become unstable (yellow). I monitor atop in node 24 and see disk use 100%. Then I disable compaction in UI both data and view. Node 24 back to green. Eventhough compaction was disable, compaction in view5 production still running. When ever view5 production running compaction, rebalance process was paused (look at vbucket moving between nodes). That made rebalance took more than 2 hours to finish.
Cluster has 16 millions items with resident ratio 55%, 12 dev views, 2 pro views, load about 3K ops
Hide
Aleksey Kondratenko
added a comment -
Tony, if you think that slowness is a problem, or something else is a problem, please file new bug.
Show
Aleksey Kondratenko
added a comment - Tony, if you think that slowness is a problem, or something else is a problem, please file new bug.
Hide
Thuan Nguyen
added a comment -
I will try on latest build 705. If I can repro this slowness during rebalance, I will file new bug.
Show
Thuan Nguyen
added a comment - I will try on latest build 705. If I can repro this slowness during rebalance, I will file new bug.
Show
Farshid Ghods
added a comment - all,
please retest with build 709
Show
Ketaki Gangal
added a comment - Verified on 709. Rebalance successful.
Show
Andrei Baranouski
added a comment - build 1717 - rebalance successful.
Show
Andrei Baranouski
added a comment - verified on 1717
http://www.couchbase.com/issues/secure/attachment/14373/10.3.121.15-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14372/10.3.121.14-8091-diag.txt.gz
http://www.couchbase.com/issues/secure/attachment/14371/10.3.121.13-8091-diag.txt.gz