High number of cache misses using <50% of available memory
We have a cluster in EC2 with allocated memory of 195 GB and usage of ~78GB, according to the UI. We recently added nodes after noticing more timeouts because of disk access on "get" operations, but there are still disk accesses, perhaps even more, even after adding much more capacity. We are using less than 4% of available disk space on each node.
Do items that were previously "ejected" from memory (and on disk), get moved back into memory after more capacity is added? Or is there an admin. action that has to be performed to explicitly make this happen? Does "vacuuming" do this?
We are currently running version 1.7.1.
Below is the relevant mbstat info for a single node -- as you can see there is a REALLY high number of misses on "get" operations. Any help/insight would be greatly appreciated!
$ /opt/membase/bin/mbstats localhost:11210 all|egrep
"todo|ep_queue_size|_eject|mem|max_data|hits|misses|wat|ep_kv_size|ep_max_data_size"
cas_hits: 40695709
cas_misses: 0
decr_hits: 0
decr_misses: 0
delete_hits: 3
delete_misses: 208872
ep_data_age_highwat: 9488
ep_dbname: /mnt/membase/data/default-data/default
ep_diskqueue_memory: 5056656
ep_flush_duration_highwat: 8679
ep_flusher_todo: 70114
ep_kv_size: 17307452424
ep_max_data_size: 34898706432
ep_mem_high_wat: 26174029824
ep_mem_low_wat: 20939223859
ep_num_eject_failures: 294497
ep_num_eject_replicas: 50166560
ep_num_value_ejects: 57917639
ep_queue_size: 802
ep_storage_age_highwat: 9488
get_hits: 596247661
get_misses: 2858986374
incr_hits: 0
incr_misses: 0
mem_used: 17600486111
vb_active_eject: 15785441
vb_active_ht_memory: 134548272
vb_active_itm_memory: 8933500611
vb_active_perc_mem_resident: 72
vb_active_queue_memory: 2909016
vb_pending_eject: 0
vb_pending_ht_memory: 0
vb_pending_itm_memory: 0
vb_pending_perc_mem_resident: 0
vb_pending_queue_memory: 0
vb_replica_eject: 34333273
vb_replica_ht_memory: 141626880
vb_replica_itm_memory: 7648633860
vb_replica_perc_mem_resident: 61
vb_replica_queue_memory: 2147640
Hey Yen, small world, this is Shawn back from good old days at PIX.
The vacuuming action is a sqlite compact command, it reclaims diskspace and speeds up reading k/v pair from the disk if they are not in memory, so it doesn't sound like that is causing the high missed GET%.
I'll ping a couple folks I know at Couchbase and see if I can get them to take a closer look.