Default bucket has "Hard Out of Memory Error"
We are experiencing the following error on our live membase servers:
Hard Out Of Memory Error. Bucket "default" on node "node.two.co.uk" is full. All memory allocated to this bucket is used for metadata.
We resolved this issue by increasing the memory allocated to the bucket. But we are not expecting this error. Below is the configuration of our membase cache.
No of Nodes in Cluster: 2
No of buckets in total: 6
Bucket with error: default
Bucket using replication: Yes
Bucket per node memory allocation: 512MB
The membase bucket monitor page on the web console reported the bucket to be 53% full (http://node.two.co.uk:8091/index.html#sec=monitor_buckets)
We would like to know considering the bucket was only 53% full why we were getting this error?
Hi Perry,
Here is a link to a screenshot of the page on which the default bucket was reporting ~53% RAM used when we were getting "hard out of memory error". We allocated more RAM to the bucket to resolve this issue. If this occurs again is there anything we can do that will help diagnose this issue?
Also could you clarify if this is actually just a warning or an error as the message suggests? Do the new items coming in not get written at all? do we completely lose the new item or are they just persisted. does the enyim client get a response that describes what is going on or is it an unhandled exception?
Note: there was replication enabled on the bucket and all items going into that bucket were set to expire after 15min.
Thanks. Can you run the following command against all servers in the cluster and post the output:
Windows: C:\Program Files\Membase\Server\bin\mbstats :11210 all
Linux: /opt/membase/bin/mbstats :11210 all
It would be helpful to do this now, but more helpful after you start getting those errors (and before you increase the size of the bucket).
To clarify, the "hard out of memory" errors suggest an actual error condition because Membase is unable to allocate RAM to take in new data. As opposed to a "soft" error which implies that we are in the process of draining data to disk to make space for me, the "hard" errors will require some administrative intervention to allocate more RAM.
ANYTIME you get an error from a set (anything other than "TRUE" or "STORED") you can assume that the data did not get written at all. There are some errors that might just require a retry, others that are more pathological.
Perry
Hi,
I am having this issue also.
One of my two replicated membase servers failed after a "Hard Out Of Memory Error. Bucket..." Had a hard time just failing over and restarting the set, caused my membase to be out of service for 40 minutes.
My bucket was about 50% full also, servers could not be failed over or removed without getting timeouts etc.
How often do your membase instances die / have issues per year (under amazon ec2)??
This is not the 1st time my membases failed me, I thought replication would protect me. But in order to get things running again I had to delete my default bucket re-add it and rebalance / stop repetitively.
Are there any known workarounds or bugfixes available? Anything to watch out for under amazon ec2 hosting?
Here is my log file of the event.
http://www.mediafire.com/?otw8d0y4y54fs8b
We experienced the same error while evaluating membase. It has been running for a few weeks, doing primarily writes.
We started as a two node, 30GB per node memory allocation. After running fine for a week or so (occasional timeouts, but that is another story), we added a third node. It took a long time to rebalance (26 hours?), and at the end reported that there were errors during the rebalance, but I was unable to find any detail whatsoever about said errors. It appeared to run fine for days afterwords.
There are about 130M objects in the "default" bucket. Two nodes are running on Centos5 and the new, third node on Centos6.
This morning the newest node started reporting:
"Hard Out Of Memory Error. Bucket "default" on node 10.xx.yy.zz is full. All memory allocated to this bucket is used for metadata. (repeated 119 times)"
The "CLUSTER OVERVIEW" reports that of 87.8GB allocated, 67.5GB are in use with 20.3GB unused. The stats for the node reporting errors indicate 7.9G for active user data in RAM, 9.24G for replica user data in RAM for a total of 24.2 user data in RAM. Metadata is 269M active, 269M replica, 538M total in RAM. There is nothing listed for pending at all. Disk queues are empty.
It is not viable to add more RAM as we allocated about as much as the boxes are capable of (30GB on 32GB servers).
The memcachd process is still running hot even though there have been no new reads or writes for over two hours.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6273 membase 20 0 32.5g 30g 1848 S 100.5 95.5 1274:59 memcached
I do have full mbstats output available if that helps.
Any suggestions for further diagnostic, solutions, steps forward would be greatly appreciated.
Hi ep,
Are you running under amazon EC2?
After getting this error I got another odd one reporting that my attached /mnt drive was now read only!
It seems that this can occur on some failing EC2 hardware. I started a new instance, installed and added the new node. I then dropped the bad EC2 instance, I have been running with no errors now for about 1 month.
Hope that helps.
Chris
Hi Chris,
No EC2. We are running on 3 decent servers, on the same gigabit lan, with local disk.
No indications from the kernel that there were any disk for filesystem problems.
2GB of swap are in use, but there is still plenty of swap free.
And hours later, with now read or write operations happening, the memcached process is still using a full CPU!
Thanks,
Erik
Any progress, on the "Hard Out Of Memory Error"?
Anyone?
Hi Perry,
Started getting this error again.
below are the results from teh mbstats cmd
accepting_conns: 1
auth_cmds: 97
auth_errors: 0
bucket_active_conns: 1
bucket_conns: 85
bytes_read: 61624279639
bytes_written: 32169935674
cas_badval: 0
cas_hits: 0
cas_misses: 0
cmd_flush: 1
cmd_get: 2396431
cmd_set: 2241536
conn_yields: 18
connection_structures: 1906
curr_connections: 945
curr_items: 548
curr_items_tot: 548
daemon_connections: 10
decr_hits: 0
decr_misses: 0
delete_hits: 0
delete_misses: 0
ep_bg_fetched: 0
ep_commit_num: 67855
ep_commit_time: 0
ep_commit_time_total: 4064
ep_data_age: 1
ep_data_age_highwat: 11
ep_db_cleaner_status: complete
ep_db_strategy: multiMTVBDB
ep_dbinit: 0
ep_dbname: e:/membase/data/default-data/default
ep_dbshards: 4
ep_diskqueue_drain: 167162
ep_diskqueue_fill: 167162
ep_diskqueue_items: 0
ep_diskqueue_memory: 0
ep_diskqueue_pending: 0
ep_expired: 254492
ep_flush_all: false
ep_flush_duration: 0
ep_flush_duration_highwat: 12
ep_flush_duration_total: 4126
ep_flush_preempts: 0
ep_flusher_state: running
ep_flusher_todo: 0
ep_io_num_read: 3126
ep_io_num_write: 262991
ep_io_read_bytes: 37714564
ep_io_write_bytes: 5204945519
ep_item_begin_failed: 0
ep_item_commit_failed: 0
ep_item_flush_expired: 220571
ep_item_flush_failed: 0
ep_items_rm_from_checkpoints: 513111
ep_kv_size: 756340090
ep_latency_arith_cmd: 0
ep_latency_get_cmd: 2442038
ep_latency_store_cmd: 2241536
ep_max_data_size: 954204160
ep_max_txn_size: 1000
ep_mem_high_wat: 715653120
ep_mem_low_wat: 572522496
ep_min_data_age: 0
ep_num_active_non_resident: 0
ep_num_checkpoint_remover_runs: 217435
ep_num_eject_failures: 0
ep_num_eject_replicas: 0
ep_num_expiry_pager_runs: 324
ep_num_non_resident: 0
ep_num_not_my_vbuckets: 90619
ep_num_pager_runs: 95760
ep_num_value_ejects: 0
ep_onlineupdate: false
ep_onlineupdate_revert_add: 0
ep_onlineupdate_revert_delete: 0
ep_onlineupdate_revert_update: 0
ep_oom_errors: 1925404
ep_overhead: 61805631
ep_pending_ops: 0
ep_pending_ops_max: 0
ep_pending_ops_max_duration: 0
ep_pending_ops_total: 0
ep_queue_age_cap: 900
ep_queue_size: 0
ep_storage_age: 1
ep_storage_age_highwat: 605
ep_storage_type: featured
ep_store_max_concurrency: 10
ep_store_max_readers: 9
ep_store_max_readwrite: 1
ep_tap_bg_fetch_requeued: 0
ep_tap_bg_fetched: 0
ep_tap_keepalive: 300
ep_tmp_oom_errors: 8611
ep_too_old: 0
ep_too_young: 0
ep_total_cache_size: 1567393132
ep_total_del_items: 220591
ep_total_enqueued: 487497
ep_total_new_items: 220052
ep_total_persisted: 483582
ep_uncommitted_items: 0
ep_value_size: 756228893
ep_vb_total: 1024
ep_vbucket_del: 446
ep_vbucket_del_avg_walltime: 2338
ep_vbucket_del_fail: 0
ep_vbucket_del_max_walltime: 95640
ep_vbucket_del_total_walltime: 1042920
ep_version: 1.6.5.3_257_g82152fd
ep_warmed_up: 2102
ep_warmup: true
ep_warmup_dups: 0
ep_warmup_oom: 0
ep_warmup_thread: complete
ep_warmup_time: 199342130
get_hits: 202141
get_misses: 2194290
incr_hits: 0
incr_misses: 0
libevent: 2.0.11-stable
limit_maxbytes: 67108864
listen_disabled_num: 0
mem_used: 818145721
pid: 1220
pointer_size: 64
rejected_conns: 0
tap_checkpoint_end_received: 550796
tap_checkpoint_end_sent: 22191750
tap_checkpoint_start_received: 574321
tap_checkpoint_start_sent: 22208274
tap_connect_received: 107
tap_delete_received: 53631
tap_delete_sent: 260684
tap_flush_sent: 21
tap_mutation_received: 1247360
tap_mutation_sent: 159893726
tap_opaque_received: 1425
tap_opaque_sent: 260
threads: 4
time: 1319104633
total_connections: 4905407
uptime: 1168762
vb_active_curr_items: 548
vb_active_eject: 0
vb_active_ht_memory: 12775424
vb_active_itm_memory: 18450332
vb_active_num: 512
vb_active_num_non_resident: 0
vb_active_ops_create: 74058
vb_active_ops_delete: 73510
vb_active_ops_reject: 0
vb_active_ops_update: 18008
vb_active_perc_mem_resident: 100
vb_active_queue_age: 0
vb_active_queue_drain: 167162
vb_active_queue_fill: 167162
vb_active_queue_memory: 0
vb_active_queue_pending: 0
vb_active_queue_size: 0
vb_dead_num: 0
vb_pending_curr_items: 0
vb_pending_eject: 0
vb_pending_ht_memory: 0
vb_pending_itm_memory: 0
vb_pending_num: 0
vb_pending_num_non_resident: 0
vb_pending_ops_create: 0
vb_pending_ops_delete: 0
vb_pending_ops_reject: 0
vb_pending_ops_update: 0
vb_pending_perc_mem_resident: 0
vb_pending_queue_age: 0
vb_pending_queue_drain: 0
vb_pending_queue_fill: 0
vb_pending_queue_memory: 0
vb_pending_queue_pending: 0
vb_pending_queue_size: 0
vb_replica_curr_items: 0
vb_replica_eject: 0
vb_replica_ht_memory: 12775424
vb_replica_itm_memory: 0
vb_replica_num: 512
vb_replica_num_non_resident: 0
vb_replica_ops_create: 0
vb_replica_ops_delete: 0
vb_replica_ops_reject: 0
vb_replica_ops_update: 0
vb_replica_perc_mem_resident: 0
vb_replica_queue_age: 0
vb_replica_queue_drain: 0
vb_replica_queue_fill: 0
vb_replica_queue_memory: 0
vb_replica_queue_pending: 0
vb_replica_queue_size: 0
version: 1.4.4_461_gf99c147
Many, many memory accounting issues have been fixed in the 1.8 series. Have you considered upgrading to 1.8? See couchbase.com/docs for details on what that entails.
I'd need a bit more data to give a complete answer, but that error is generated when Membase is unable to allocate anymore RAM to take in new items. The 53% used does seem a bit odd, can you show me a screenshot of the monitoring page?
Perry
Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!