Details
Description
Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- 24.8 GB RAM for couchbase server at each node
- SSD disk format ext4 on /data
- Each server has its own SSD drive, no disk sharing with other server.
- Create cluster with 6 nodes installed couchbase server 2.0.0-1908
- Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1908-rel.rpm.manifest.xml
- Cluster has 2 buckets, default (12GB with 2 replica) and saslbucket (12GB with 1 replica).
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43
* Load 16 million items to default bucket and 20 million items to saslbuckett. Each key has size from 512 bytes to 1024 bytes
* After done loading, wait until initial index. Disable view compaction.
* After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
* Queries all 4 views from 2 docs
* Do swap rebalance, remove node 39, 40 and add node 44, 45.
* At the end of rebalance saslbucket, rebalance exited with timeout on node 43
* Then see a lot of reset connection to mccouch. Updated bugMB-7046
* Kill all loads pointing to this cluster. Node 43 did not back to stable state.
* beam.smp is running but node 43 still down.
* Kill beam.smp by sigusr1 to create erlang core dump
Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201210/orange-ci-1908-node43-down-erl-hang-20121030.tgz
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- 24.8 GB RAM for couchbase server at each node
- SSD disk format ext4 on /data
- Each server has its own SSD drive, no disk sharing with other server.
- Create cluster with 6 nodes installed couchbase server 2.0.0-1908
- Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1908-rel.rpm.manifest.xml
- Cluster has 2 buckets, default (12GB with 2 replica) and saslbucket (12GB with 1 replica).
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43
* Load 16 million items to default bucket and 20 million items to saslbuckett. Each key has size from 512 bytes to 1024 bytes
* After done loading, wait until initial index. Disable view compaction.
* After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
* Queries all 4 views from 2 docs
* Do swap rebalance, remove node 39, 40 and add node 44, 45.
* At the end of rebalance saslbucket, rebalance exited with timeout on node 43
* Then see a lot of reset connection to mccouch. Updated bug
* Kill all loads pointing to this cluster. Node 43 did not back to stable state.
* beam.smp is running but node 43 still down.
* Kill beam.smp by sigusr1 to create erlang core dump
Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201210/orange-ci-1908-node43-down-erl-hang-20121030.tgz
As pointed out we've seen node .43 sitting there doing nothing.
I ssh-ed to this machine and found erlang eating 100% CPU and about 8 gigs (!) of RAM.
Then I sent SIGUSR1 to that process but it seemed like erlang ignored it. So I concluded it's stuck more badly than expected.
I gdb-ed to that process and found all threads but one idle. That idle thread was seemingly stuck in some bignum code in erlang. I.e. I did 'finish' command to wait until it steps out of some call and few seconds later concluded it's stuck there, apparently looping.
Then I made decision to capture process' state in core dump. And this is my mistake number 1. I did 'call abort()' in gdb causing process to abort and dump core. But because I did that on thread that was most valuable I've found backtrace to have _no_ useful information in that thread. Instead backtrace of that thread had stack frames set up by gdb for abort call. Stupid move :(
So information was lost there.
Also by examining backtraces in core dump I've found that erlang actually started dumping crash dump. And I've found partially written crash dump on disk. Unfortunately incompleteness of that crash dump doesn't allow me to make any conclusions. So mistake (and lesson) #2. I should have waited more until erlang starts dumping crash dump.
Sorry, folks. It appears we'll have to wait until this happens again.