ep_num_eject_failures is much higher (up to 1000x) then ep_num_value_ejects

Worth noticing that this can be caused by bug that i reported before OR be cause of this bug http://www.couchbase.com/communities/q-and-a/ram-stats-do-not-reconcile-and-evictions-seems-be-not-working-expected and this one http://www.couchbase.com/communities/q-and-a/data-loss-251

Server works for some time, then stops ejecting items, and thows TEMP OOM errors. Only restart helps.

We having problems with ejections, and stats shows that ejections barely working. What is the possible reason, and what stats/logs i should watch first?

cbstats output from one of servers:

accepting_conns: 1
auth_cmds: 4
auth_errors: 0
bucket_active_conns: 1
bucket_conns: 57
bytes: 2923459464
bytes_read: 50697857251
bytes_written: 3663478624
cas_badval: 0
cas_hits: 0
cas_misses: 0
cmd_flush: 0
cmd_get: 9378
cmd_set: 241113
conn_yields: 1947
connection_structures: 5000
curr_connections: 259
curr_conns_on_port_11209: 111
curr_conns_on_port_11210: 146
curr_items: 3715265
curr_items_tot: 7066672
curr_temp_items: 0
daemon_connections: 4
decr_hits: 0
decr_misses: 0
delete_hits: 0
delete_misses: 0
ep_access_scanner_last_runtime: 0
ep_access_scanner_num_items: 0
ep_access_scanner_task_time: 2014-04-26 10:00:00
ep_allow_data_loss_during_shutdown: 1
ep_alog_block_size: 4096
ep_alog_path: /media/data/default/access.log
ep_alog_sleep_time: 1440
ep_alog_task_time: 10
ep_backend: couchdb
ep_bg_fetch_delay: 0
ep_bg_fetched: 3505
ep_bg_load: 65111098
ep_bg_load_avg: 18576
ep_bg_max_load: 467888
ep_bg_max_wait: 177918
ep_bg_meta_fetched: 0
ep_bg_min_load: 102
ep_bg_min_wait: 30
ep_bg_num_samples: 3505
ep_bg_remaining_jobs: 0
ep_bg_wait: 2618920
ep_bg_wait_avg: 747
ep_chk_max_items: 5000
ep_chk_period: 1800
ep_chk_persistence_remains: 0
ep_chk_persistence_timeout: 10
ep_chk_remover_stime: 5
ep_commit_num: 221903
ep_commit_time: 11
ep_commit_time_total: 2955151
ep_config_file:
ep_conflict_resolution_type: seqno
ep_couch_bucket: default
ep_couch_host: 127.0.0.1
ep_couch_port: 11213
ep_couch_reconnect_sleeptime: 250
ep_couch_response_timeout: 180000
ep_data_traffic_enabled: 0
ep_dbname: /media/data/default
ep_degraded_mode: 0
ep_diskqueue_drain: 3670099
ep_diskqueue_fill: 3670101
ep_diskqueue_items: 2
ep_diskqueue_memory: 64
ep_diskqueue_pending: 234
ep_exp_pager_stime: 3600
ep_expired_access: 0
ep_expired_pager: 0
ep_expiry_window: 3
ep_failpartialwarmup: 0
ep_flush_all: false
ep_flush_duration_total: 3868
ep_flushall_enabled: 0
ep_flusher_state: running
ep_flusher_todo: 1
ep_getl_default_timeout: 15
ep_getl_max_timeout: 30
ep_ht_locks: 5
ep_ht_size: 3079
ep_inconsistent_slave_chk: 0
ep_initfile:
ep_io_num_read: 332114
ep_io_num_write: 3734036
ep_io_read_bytes: 3429644805
ep_io_write_bytes: 41108043159
ep_item_begin_failed: 0
ep_item_commit_failed: 0
ep_item_flush_expired: 0
ep_item_flush_failed: 0
ep_item_num_based_new_chk: 1
ep_items_rm_from_checkpoints: 4054
ep_keep_closed_chks: 0
ep_klog_block_size: 4096
ep_klog_compactor_queue_cap: 500000
ep_klog_compactor_stime: 3600
ep_klog_flush: commit2
ep_klog_max_entry_ratio: 10
ep_klog_max_log_size: 2147483647
ep_klog_path:
ep_klog_sync: commit2
ep_kv_size: 2737261629
ep_max_bg_remaining_jobs: 0
ep_max_checkpoints: 2
ep_max_data_size: 7340032000
ep_max_item_size: 20971520
ep_max_num_workers: 4
ep_max_size: 7340032000
ep_max_txn_size: 10000
ep_max_vbuckets: 1024
ep_mem_high_wat: 3039027200
ep_mem_low_wat: 1039027200
ep_mem_tracker_enabled: true
ep_meta_data_memory: 962414800
ep_mlog_compactor_runs: 0
ep_mutation_mem_threshold: 95
ep_num_access_scanner_runs: 0
ep_num_eject_failures: 331083391
ep_num_expiry_pager_runs: 0
ep_num_non_resident: 6907022
ep_num_not_my_vbuckets: 2596
ep_num_ops_del_meta: 0
ep_num_ops_del_meta_res_fail: 0
ep_num_ops_del_ret_meta: 0
ep_num_ops_get_meta: 0
ep_num_ops_get_meta_on_set_meta: 0
ep_num_ops_set_meta: 0
ep_num_ops_set_meta_res_fail: 0
ep_num_ops_set_ret_meta: 0
ep_num_pager_runs: 87
ep_num_value_ejects: 3363851
ep_oom_errors: 0
ep_overhead: 60788945
ep_pager_active_vb_pcnt: 40
ep_pending_ops: 0
ep_pending_ops_max: 0
ep_pending_ops_max_duration: 0
ep_pending_ops_total: 0
ep_postInitfile:
ep_queue_size: 2
ep_startup_time: 1398483585
ep_storage_age: 0
ep_storage_age_highwat: 183
ep_tap_ack_grace_period: 300
ep_tap_ack_initial_sequence_number: 1
ep_tap_ack_interval: 1000
ep_tap_ack_window_size: 10
ep_tap_backfill_resident: 0.9
ep_tap_backlog_limit: 5000
ep_tap_backoff_period: 5
ep_tap_bg_fetch_requeued: 0
ep_tap_bg_fetched: 332
ep_tap_bg_max_pending: 500
ep_tap_keepalive: 300
ep_tap_noop_interval: 20
ep_tap_requeue_sleep_time: 0.1
ep_tap_throttle_cap_pcnt: 10
ep_tap_throttle_queue_cap: 1000000
ep_tap_throttle_threshold: 90
ep_tmp_oom_errors: 0
ep_total_cache_size: 2650348920
ep_total_del_items: 0
ep_total_enqueued: 3734172
ep_total_new_items: 3484029
ep_total_persisted: 3734036
ep_uncommitted_items: 1
ep_uuid: 1feea58bbdfcacf8382f2716b0dc097c
ep_value_size: 1736474231
ep_vb0: 0
ep_vb_snapshot_total: 861
ep_vb_total: 342
ep_vbucket_del: 185
ep_vbucket_del_avg_walltime: 494345
ep_vbucket_del_fail: 0
ep_vbucket_del_max_walltime: 2164792
ep_version: 2.5.1_1083_rel
ep_waitforwarmup: 0
ep_warmup: 1
ep_warmup_batch_size: 1000
ep_warmup_dups: 0
ep_warmup_min_items_threshold: 100
ep_warmup_min_memory_threshold: 100
ep_warmup_oom: 0
ep_warmup_thread: complete
ep_warmup_time: 280564496
ep_workload_optimization: read
get_hits: 8988
get_misses: 390
incr_hits: 0
incr_misses: 0
libevent: 2.0.11-stable
limit_maxbytes: 67108864
listen_disabled_num: 0
max_conns_on_port_11209: 1000
max_conns_on_port_11210: 9000
mem_used: 2923459464
pid: 31212
pointer_size: 64
rejected_conns: 0
rusage_system: 453.715641
rusage_user: 1519.307413
tap_checkpoint_end_received: 171
tap_checkpoint_end_sent: 171
tap_checkpoint_start_received: 957
tap_checkpoint_start_sent: 1127
tap_connect_received: 28
tap_mutation_received: 4239027
tap_mutation_sent: 163939
tap_opaque_received: 428
tap_opaque_sent: 56
tcp_nodelay: enable
threads: 4
time: 1398486612
total_connections: 8124
uptime: 3037
vb_active_curr_items: 3715265
vb_active_eject: 14446
vb_active_expired: 0
vb_active_ht_memory: 33298080
vb_active_itm_memory: 581201744
vb_active_meta_data_memory: 505693856
vb_active_num: 171
vb_active_num_non_resident: 3710822
vb_active_ops_create: 2492
vb_active_ops_delete: 0
vb_active_ops_reject: 0
vb_active_ops_update: 112874
vb_active_perc_mem_resident: 0
vb_active_queue_age: 0
vb_active_queue_drain: 162948
vb_active_queue_fill: 162950
vb_active_queue_memory: 64
vb_active_queue_pending: 234
vb_active_queue_size: 2
vb_dead_num: 0
vb_pending_curr_items: 0
vb_pending_eject: 0
vb_pending_expired: 0
vb_pending_ht_memory: 0
vb_pending_itm_memory: 0
vb_pending_meta_data_memory: 0
vb_pending_num: 0
vb_pending_num_non_resident: 0
vb_pending_ops_create: 0
vb_pending_ops_delete: 0
vb_pending_ops_reject: 0
vb_pending_ops_update: 0
vb_pending_perc_mem_resident: 0
vb_pending_queue_age: 0
vb_pending_queue_drain: 0
vb_pending_queue_fill: 0
vb_pending_queue_memory: 0
vb_pending_queue_pending: 0
vb_pending_queue_size: 0
vb_replica_curr_items: 3351407
vb_replica_eject: 3209177
vb_replica_expired: 0
vb_replica_ht_memory: 26370112
vb_replica_itm_memory: 2069147176
vb_replica_meta_data_memory: 456720944
vb_replica_num: 171
vb_replica_num_non_resident: 3196200
vb_replica_ops_create: 3351407
vb_replica_ops_delete: 0
vb_replica_ops_reject: 0
vb_replica_ops_update: 137134
vb_replica_perc_mem_resident: 4
vb_replica_queue_age: 0
vb_replica_queue_drain: 3507151
vb_replica_queue_fill: 3507151
vb_replica_queue_memory: 0
vb_replica_queue_pending: 0
vb_replica_queue_size: 0

I think i give logs from wrong server, user this instead: https://gist.github.com/buger/35677e83ec4b1abe1644

If you haven’t already seen it, there’s a good blog post on monitoring a Couchbase cluster at: http://blog.couchbase.com/how-many-nodes-part-4-monitoring-sizing