Search:

Search all manuals
Search this manual
Manual
Membase Manual 1.7
Additional Resources
Community Wiki
Community Forums
Couchbase SDKs
Parent Section
4 Membase Administration
Chapter Sections
Chapters

4.4. Monitoring Membase

4.4.1. Membase Statistics

There are a number of different ways in which you can monitor Couchbase. You should be aware however of some of the basic issues that you will need to know before starting your monitoring procedure.

Port numbers and accessing different buckets

In a Membase cluster, any communication (stats or data) to a port OTHER than 11210 will result in the request going through a Moxi process. This means that any stats request will be aggregated across the cluster (and may produce some inconsistencies or confusion when looking at stats that are not "aggregatable").

In general, it is best to run all your stat commands against port 11210 which will always give you the information for the specific node that you are sending the request to. It is a best practice to then aggregate the relevant data across nodes at a higher level (in your own script or monitoring system).

When you run the below commands (and all stats commands) without supplying a bucket name and/or password, they will return results for the default bucket and produce an error if one does not exist.

To access a bucket other than the default, you will need to supply the bucket name and/or password on the end of the command. Any bucket created on a dedicated port does not require a password.

Monitoring startup (warmup)

If a Membase node is starting up for the first time, it will create whatever DB files necessary and begin serving data immediately. However, if there is already data on disk (likely because the node rebooted or the service restarted) the node needs to read all of this data off of disk before it can begin serving data. This is called "warmup". Depending on the size of data, this can take some time.

When starting up a node, there are a few statistics to monitor. Use the /opt/membase/bin/ep_engine/management/stats (Linux) or C:\Program Files\Membase\Server\bin\ep_engine\management\stats (Windows) command to watch the warmup and item stats:

/opt/membase/bin/ep_engine/management/stats localhost:11210 all | egrep "warm|curr_items"

curr_items:0
curr_items_tot:15687
ep_warmed_up:15687
ep_warmup:false
ep_warmup_dups:0
ep_warmup_oom:0
ep_warmup_thread:running
ep_warmup_time:787

And when it is complete:

/opt/membase/bin/ep_engine/management/stats localhost:11210 all | egrep "warm|curr_items"

curr_items:10000
curr_items_tot:20000
ep_warmed_up:20000
ep_warmup:true
ep_warmup_dups:0
ep_warmup_oom:0
ep_warmup_thread:complete
ep_warmup_time1400

Table 4.10. Stats

StatDescription
curr_itemsThe number of items currently active on this node. During warmup, this will be 0 until complete
curr_items_totThe total number of items this node knows about (active and replica). During warmup, this will be increasing and should match ep_warmed_up
ep_warmed_upThe number of items retrieved from disk. During warmup, this should be increasing.
ep_warmup_dupsThe number of duplicate items found on disk. Ideally should be 0, but a few is not a problem
ep_warmup_oomHow many times the warmup process received an Out of Memory response from the server while loading data into RAM
ep_warmup_threadThe status of the warmup thread. Can be either running or complete
ep_warmup_timeHow long the warmup thread was running for. During warmup this number should be increasing, when complete it will tell you how long the process took

Disk Write Queue

Membase is a persistent database which means that part of monitoring the system is understanding how we interact with the disk subsystem.

Since Membase is an asynchronous system, any mutation operation is committed first to DRAM and then queued to be written to disk. The client is returned an acknowledgement almost immediately so that it can continue working. There is replication involved here too, but we're ignoring it for the purposes of this discussion.

We have implemented disk writing as a 2-queue system and they are tracked by the stats. The first queue is where mutations are immediately placed. Whenever there are items in that queue, our "flusher" (disk writer) comes along and takes all the items off of that queue, places them into the other one and begins writing to disk. Since disk performance is so dramatically different than RAM, this allows us to continue accepting new writes while we are (possibly slowly) writing new ones to the disk.

The flusher will process 250k items a a time, then perform a disk commit and continue this cycle until its queue is drained. When it has completed everything in its queue, it will either grab the next group from the first queue or essentially sleep until there are more items to write.

Handling Reads

In its current implementation, SQLite only has one connection into the database for both reads and writes. This means that we need to do something special when reads come in while the flusher is writing to disk.

If a request comes in for an item that is on disk, it will preempt the writing process in order to serve this read.

Monitoring the Disk Write Queue

There are basically two ways to monitor the disk queue, at a high-level from the Web UI or at a low-level from the individual node statistics.

From the Web UI, click on Monitor Data Buckets and select the particular bucket that you want to monitor. Click "Configure View" in the top right corner and select the "Disk Write Queue" statistic. Closing this window will show that there is a new mini-graph. This graph is showing the Disk Write Queue for all nodes in the cluster. To get a deeper view into this statistic, you can monitor each node individually using the 'stats' output (see here for more information about gathering node-level stats). There are two statistics to watch here:

ep_queue_size (where new mutations are placed) flusher_todo (the queue of items currently being written to disk)

See dispatcher for more information about monitoring what the disk subsystem is doing at any given time.