Important Statistics for Diagnosis
The 'watermark' determines when it is necessary to start freeing up available memory. (read more about this concept here) Some important statistics related to water marks are: High WaterMark (ep_mem_high_wat): The system will start ejecting values out of memory when this watermark is met. Ejected values need to be fetched from disk, when accessed. Low WaterMark (ep_mem_low_wat): The system does not do anything when this watermark is reached but this is the 'goal' of the system when it starts ejecting data as a result of high watermark being met. Memory Used (mem_used):The current size of memory used.If mem_used hits the RAM quota then you will get OOM_ERROR. The mem_used must be less than ep_mem_high_wat which is the mark at which data is ejected from the disk. Disk Write Queue Size (ep_queue_size): The size of the queue that has data waiting to be written to the disk. Cache Hits (get_hits): The rule of thumb is that this should be at least 90% of the total requests. Cache Misses (get_misses):
You can find values for these important stats with the following command: "/opt/membase/bin/ep_engine/management/stats <IP>:11210 all | egrep "todo|ep_queue_size|_eject|mem|max_data|hits|misses" will output: ep_flusher_todo: ep_max_data_size: ep_mem_high_wat: ep_mem_low_wat: ep_num_eject_failures: ep_num_value_ejects: ep_queue_size: mem_used: get_misses: get_hits:
Important UI Stats to watch
You can add the following graphs to watch on the Membase console. These graphs can be de/selected by clicking on the "Configure View" link at the top of the Bucket Details (Monitor->Data Buckets) page on the Membase console.
Disk write queues - It should not keep growing. (the actual numbers will depend on your application and deployment )
Ram ejections - no sudden spikes
Vbucket errors (increasing value for vbucket errors is bad)
OOM errors per sec (This should be 0)
Temp OOM errors per sec (This should be 0)
Connections count (This should remain flat in a long running deployment)
Get hits per second
Get misses per second (This should be much lower than Get hits per second)
Make sure that you monitor disk space, CPU usage and swapping on all your nodes, using the standard monitoring tools.
Vacuuming reclaims disk space from sqlite by de-fragmenting the database. You should vacuum your sqlite files regularly to free any space that is empty but unusable.
After the rebalancing operation itself is complete, the Membase cluster will start replicating any data that was moved. In the future we may include this replication process in the overall rebalancing itself, but for now you will have to be aware that some data may not be replicated immediately following a rebalance.
It is a best practice to continue to monitor the system until you are confident that replication has completed. There are essentially two stages to replication:
Backfilling - This is the first stage of replication and involves reading all data for a given active vbucket and sending it to the server that is responsible for the replica. This can put increased load on the disk subsystem as well as network bandwidth but is not designed to impact any client activity. You can monitor the progress of this task by watching for ongoing TAP disk fetches and/or watching 'stats tap': '/opt/membase/bin/ep_engine/management/stats <membase_node>:11210 tap | grep backfill' will return a list of TAP backfill processes and whether they are still running (true) or done (false). When all have completed, you should see the Total Item count ( curr_items_tot _as opposed to the active item count, _curr_items) be equal to the number of active items times the replica count. If you are continuously adding data to the system, these values may not line up exactly at a given instant in time, but it should be clear from an order-of-magnitude sense whether your items are properly replicated. Until this is completed, you should avoid using the "failover" functionality since that may result in loss of the data that has not been replicated yet.
Draining - After the backfill process is completed, all nodes that had replicas materialized on them will need to also persist those items to disk. It is important to continue monitoring the disk write queue and memory usage of