Unbalanced load on one node, crashing
We have a two node cluster which was expanded to three in an attempt to resolve this issue. The CPU and Disk IO load on one node is always much higher than the other nodes. We cannot access the web interface on two of the nodes, just the third one that was recently added.
The crashes produce a erl_crash.dump. An interesting line in there is:
Slogan: eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "heap").
We have one couchbase bucket that actually gets used "efp_click", RAM usage is 212MB/3GB disk usage is 51.2MB. We see around 250 operations per second for the whole cluster.
The error logs are populated with massive numbers of errors. They usually look like this:
[error_logger:error] [2013-02-11 10:16:34] [ns_1@10.201.14.43:error_logger:ale_error_logger_handler:log_msg:76] Mnesia('ns_1@10.201.14.43'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
[ns_server:error] [2013-02-11 10:16:35] [ns_1@10.201.14.43:'ns_memcached-efp_click':ns_memcached:handle_info:437] handle_info(ensure_bucket,..) took too long: 658558 us
[ns_server:error] [2013-02-11 10:16:37] [ns_1@10.201.14.43:<0.2506.0>:ns_memcached:verify_report_long_call:209] call {stats,<<>>} took too long: 1325716 us
[ns_server:error] [2013-02-11 10:16:44] [ns_1@10.201.14.43:<0.2508.0>:ns_memcached:verify_report_long_call:209] call topkeys took too long: 659986 us
[ns_server:error] [2013-02-11 10:16:48] [ns_1@10.201.14.43:<0.2506.0>:ns_memcached:verify_report_long_call:209] call {stats,<<>>} took too long: 1021585 us
[ns_server:error] [2013-02-11 10:17:15] [ns_1@10.201.14.43:<0.2506.0>:ns_memcached:verify_report_long_call:209] call topkeys took too long: 1042206 us
[ns_server:error] [2013-02-11 10:17:15] [ns_1@10.201.14.43:'ns_memcached-efp_click':ns_memcached:handle_info:437] handle_info(ensure_bucket,..) took too long: 984000 usI have all logs and a analysis output from a couchbase core dump available, please just let me know what I can do to help troubleshoot this issue further.
Regards,
Bryan
Thanks for the reply Tug.
Each node is a virtual machine (ESX based) with identical resources:
* SuSE Linux Enterprise Server 11, Patch 2
* x86_64
* 8GB RAM
* 2x vCPU
* 10GB / partition
The issue originally affected the first node about a week ago (10.201.14.42). Now it seems to have moved to the second node (10.201.14.43). The third node that was added recently hasn't crashed.
The operations per second on each node appears to be even between all nodes according to the couchbase console. However the node that crashes will always have much higher RAM and CPU usage. Checking with "top" on the affected node shows the following two processes as using the most CPU time and memory by far:
1. /opt/couchbase/lib/erlang/erts-5.8.4/bin/beam.smp
2. /opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler.so
beam.smp constantly uses 200% of CPU and 40% of memory. memcached is much lower but always second in resource usage. The other nodes show the same processes at the top of the usage list, however they don't use nearly as much CPU time and memory. The only other process that may add load to these nodes is RabbitMQ which doesn't appear to add much.
Here's the load, RAM usage, and disk usage for each node:
## Node 1 (hasn't crashed for a while, web console still unusable along with couchbase-cli commands directed at it)
Load: 0.61
Disk: 3.4GB free out of 8GB
RAM: 4676MB free out of 7874MB
## Node 2 (crashes often, web console still unusable along with couchbase-cli commands directed at it)
Load: 2.17
Disk: 5.1G free out of 8GB
RAM: 3361MB free out of 7874MB
## Node 3 (Works OK since added to cluster)
Load: 0.20
Disk: 5.3G free out of 8GB
RAM: 5781MB free out of 7874MB
The only bucket in use (efp_click) has been given a 2GB per node RAM quota over the course of this issue. Increasing that number doesn't seem to mitigate the crashing. Everything in couchbase console seems to indicate resource usage shouldn't be an issue if I'm looking at it correctly.
## Bucket RAM details:
Other Buckets (384 MB)
This Bucket (6 GB)
Free (13.6 GB)
Dynamic RAM Quota: 6GB
----------
I've uploaded several diagnostics zips to http://s3.amazonaws.com/customers.couchbase.com/adknowledge/. I'd prefer not to expose the filenames publicly but let me know if I must.
Regards,
Bryan
Thanks Bryan.
I have escalated this to the server expert. You can send me the file names as a Private Message in the forum. (Send PM just under my name in the discussion)
Regards
Hello,
Were you able to send me the list of files (names) you can do it at tug[at]couchbase[dot].com
Thanks
Hi tgrall,
I sent the filenames to you in a private message. I will send them to that email address also.
Regards,
Bryan
Hello,
Could you tell us more about the cluster topology:
- can you describe each nodes?
- the hardware/OS?
- memory size?
- do you have disk and memory space?
When you are saying that it is unbalance do you see specific numbers in the console? (anything special running on the 2 nodes where you cannot use the console)
In addition to these information you can upload the dump and log using our standard support process documented here:
http://www.couchbase.com/wiki/display/couchbase/Working+with+the+Couchba...
let us know when done.
Regards
Tug
@tgrall