Details
Description
Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- 24.8 GB RAM for couchbase server at each node
- SSD disk format ext4 on /data
- Each server has its own SSD drive, no disk sharing with other server.
- Create cluster with 6 nodes installed couchbase server 2.0.0-1931
- Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1931-rel.rpm.manifest.xml
- Cluster has 2 buckets, default and saslbucket (12GB/each with 1 replica) and with 64 vbuckets setup.
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43
* Load 20 million items to each bucket. Each key has size 1024 bytes
* After done loading, wait until initial index.
* After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
* Queries all 4 views from 2 docs
* Add node 44 and rebalance. Passed
* Add node 45 and rebalance. Passed.
* Check auto failover is enable on cluster.
* Turn on firewall on node 40
iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
iptables -A OUTPUT -p tcp -o eth0 --sport 1000:60000 -j REJECT
* Node 40 was down as expected.
* Auto failover kicked in after one minute.
* Disable firewall on node 40. Cluster saw node 40 up.
* Add node 40 back to cluster and rebalance. In few seconds, rebalance failed with error
[rebalance:error,2012-11-06T0:41:48.498,ns_1@10.6.2.37:<0.4077.2612>:ns_rebalancer:do_wait_buckets_shutdown:204]Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.6.2.40',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["default"]}}}]
[user:info,2012-11-06T0:41:48.500,ns_1@10.6.2.37:<0.14641.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {buckets_shutdown_wait_failed,
[{'ns_1@10.6.2.40',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["default"]}}}]}
Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201211/8nodes-ci-1931-reb-failed-undelete-old-bucket-20121106-121536.tgz
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- 24.8 GB RAM for couchbase server at each node
- SSD disk format ext4 on /data
- Each server has its own SSD drive, no disk sharing with other server.
- Create cluster with 6 nodes installed couchbase server 2.0.0-1931
- Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1931-rel.rpm.manifest.xml
- Cluster has 2 buckets, default and saslbucket (12GB/each with 1 replica) and with 64 vbuckets setup.
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43
* Load 20 million items to each bucket. Each key has size 1024 bytes
* After done loading, wait until initial index.
* After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
* Queries all 4 views from 2 docs
* Add node 44 and rebalance. Passed
* Add node 45 and rebalance. Passed.
* Check auto failover is enable on cluster.
* Turn on firewall on node 40
iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
iptables -A OUTPUT -p tcp -o eth0 --sport 1000:60000 -j REJECT
* Node 40 was down as expected.
* Auto failover kicked in after one minute.
* Disable firewall on node 40. Cluster saw node 40 up.
* Add node 40 back to cluster and rebalance. In few seconds, rebalance failed with error
[rebalance:error,2012-11-06T0:41:48.498,ns_1@10.6.2.37:<0.4077.2612>:ns_rebalancer:do_wait_buckets_shutdown:204]Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.6.2.40',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["default"]}}}]
[user:info,2012-11-06T0:41:48.500,ns_1@10.6.2.37:<0.14641.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {buckets_shutdown_wait_failed,
[{'ns_1@10.6.2.40',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["default"]}}}]}
Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201211/8nodes-ci-1931-reb-failed-undelete-old-bucket-20121106-121536.tgz
After firewall was disabled, node quickly discovered that it was actually failed over. When this happens there are two concurrent things racing each other:
* we send die! signal to memcached so that it exits quickly
* and we start bucket deletions
In this particular case memcached died rather quickly and we quickly started fresh instance (without any buckets set up yet).
Then death of original memcached caused ns_memcached to die. _And_ be restarted before we started bucket deletion.
So that restarted ns_memcached actually re-created bucket only few milliseconds later to be asked to delete it.
There's known problem in ep-engine that it won't stop bucket when warmup happens. And because we restarted memcached and recreated buckets, this is exactly what happens.
It'll have to complete warmup and then we'll be able to complete deletion of old bucket.
After that rebalance will work.
So probably not a blocker.
If it is, then I can do something about this, but note that there would still be small race in ep-engine where, for example, memcached crash just prior to bucket deletion would cause same issue. So I believe it's best to ignore this race in ns_server and instead make ep-engine delete bucket work under warmup.