confusion on real meaning of some couchbase stats
Hi,
During a few minutes of high IO wait load spikes on couchbase nodes I see that the following stats drop to ZERO
vb_active_ops_create
vb_active_ops_update
vb_replica_ops_create
vb_replica_ops_update
vb_pending_ops_create
vb_pending_ops_update
While "total amount of operations" does not drop during these spikes and I see no errors on php memcached client side when writing to couchbase. Note that load is write(set/update) only (we are in real write workload testing of real production data) so no reads/gets happen yet, which means all "total amount of operations" are writes.
So what do these 6 stats really mean?
I though they tell the number of writes (creates/updates) ops in RAM on bucket items, and if
writes to bucket in RAM drop to ZERO then it means that all write operations fail and I should see errors on client side like "unable to write to memcache".
Note this is on EC2 medium instances.
Then I tried with EC2 small instance - I observed the same behaviour with difference that during the higo IO load spikes while the single cpu in small instance was in 100% wait - some small percentage of the write operations did failed on client side with "unable to write to memcache". So I though that since the cpu is 100% on IO - if i add more cpus then this will not happen and the 6 stats will not drop to ZERO. So I switched to medium instances and surprisingly I see that client side write operations do not fail anymore but the stats still drop to zero.
PS. I also noted that the IO writes are 100% totally random - no write operation merges happen at all on disk level - which is amazingly rare. Is this of how sqlite writes are done - AFAIK sqlite does have a commit log where writes are done sequentially - so what I see is very strange.
Thanks
Alex
Hi,
Ok this is also how I understood the ops_create & ops_update stats. But as noted they drop to zero during heavy IO.
This can be clearly seen from the graphs - is there way I could post snapshots as attachemnts in this forum?
One thing I can tell for sure is that the stats drop to zero during hourly process of cleaning up expired items - during this process the disk write queue grows from from just 100-300 items to 300,000-500,000 items. But since all sets/updates do continue to succeed according to client side - it means the ops_create & ops_update stats are just wrong?
I'm not sure I understand your question on client side activity - this is just a memcached php ext that sets/updates items all the time. Nothing special.
Thanks
Alex
And also is there way to control/tune the hourly item expiration process? As it is make very heavy IO load.
You can modify how often the expiration job is run by running the following command
/opt/couchbase/bin/cbflushctl localhost:11210 set exp_pager_stime 3600
The value is in seconds and here I have specified it to run every hour (which is the default).
Also, I'll try to take a look at the stats issue tomorrow morning.
vb_active - This prefix means that this stat is the accumulation of all of the vbuckets in your server marked active. An active vbucket means that the vbucket is the amster vbucket.
vb_replica - This is the accumulation of all of the vbuckets in a server marked replica.
vb_pending - This is the accumulation of all of the vbuckets in a server marked pending. A vbucket can be in pending state during a cluster rebalance and it means that the vbucket is being copied. The bucket marked pending is the one being copied to from another corresponding active or replica vbucket. Once the copying is finished the vbucket will be marked either active or replica.
ops_create - The number of times we created new data.
ops_update - The number of times we updated existing data.
The stats shouldn't be dropping to zero so you might have found a bug. Can you explain to me exactly what you are doing from the client side so I can reproduce this and get it fixed?