Replica disk queue not draining but active is
Hardware Summary: 11 node 1.8.1 cluster with 128GB of ram per node (~97GB per node allocated to Couchbase)
Our main bucket (set up on dedicated 11216 port, no password) had 1.5 billion keys in it until about 550 million expired or were manually deleted over the past 6 weeks and it is now at 950 million keys.
Over the past week the Web UI is constantly timing out or not showing full stats for all servers. When I drilled down to the disk queue section I see the Replica disk queue has been showing ~7 million items for the past few weeks while the Active disk queue shows under 2000:
A few pictures:
http://imageshack.us/a/img820/1494/statsheg.png
I have no idea why this image reports 0 bytes available.
http://imageshack.us/a/img849/8299/clusterh.png
Here you can see some some bogus disk information on one of the servers. There are 3 servers that show bad disk information and the other 8 like fine.
http://imageshack.us/a/img541/7672/diskspace.png
A manual check of free space on the volume for that server shows it is fine:
:/opt/couchbase> df -h /dev/sda3 Filesystem Size Used Avail Use% Mounted on /dev/sda3 2.2T 60G 2.0T 3% /vm
All server hardware/operating systems are identical. Before this happened all 11 servers showed correct disk information. Block size is 4096 for the file system so it's not an issue of the SQLite files reaching 16GB each.
The pager has been running:
:/opt/couchbase> bin/cbstats localhost:11216 all | grep pager ep_exp_pager_stime: 39600 ep_num_expiry_pager_runs: 33724 ep_num_pager_runs: 0
I had tried to change the pager stime to an hour but it only wants to add whatever value I put in to the original value.
Here you can see the 7 million replica items:
:/opt/couchbase> bin/cbstats localhost:11216 all | grep queue ep_diskqueue_drain: 23737401049 ep_diskqueue_fill: 23690077845 ep_diskqueue_items: 7674564 ep_diskqueue_memory: 306982560 ep_diskqueue_pending: 597676270 ep_queue_age_cap: 9900 ep_queue_size: 2579 ep_tap_bg_fetch_requeued: 0 ep_total_enqueued: 23702822810 vb_active_queue_age: 10296983000 vb_active_queue_drain: 11739287663 vb_active_queue_fill: 11739288889 vb_active_queue_memory: 56480 vb_active_queue_pending: 108442 vb_active_queue_size: 1412 vb_pending_queue_age: 0 vb_pending_queue_drain: 0 vb_pending_queue_fill: 0 vb_pending_queue_memory: 0 vb_pending_queue_pending: 0 vb_pending_queue_size: 0 vb_replica_queue_age: 6679257552578000 vb_replica_queue_drain: 11998113386 vb_replica_queue_fill: 11950788956 vb_replica_queue_memory: 306926080 vb_replica_queue_pending: 597567828 vb_replica_queue_size: 7673152
If I check memory on the servers they all look about like this:
total used free shared buffers cached Mem: 129058 128573 485 0 569 56311 -/+ buffers/cache: 71691 57366 Swap: 2053 205 1848
It concerns me that the kernel cache is so high (50GB) and swap has been used.
Any ideas on how to get the replica queue to flush? Should I try stopping/starting persistence? The disk are not getting fully utilized, iostat bounces around between 1-30% but usually hangs around under 10%.
If I restart the cluster it usually takes 3 hours for it to warm back up. My concern is that items in the replica disk queue have not been written to disk then there is the possibility of the replicas not matching the actives when I bring the cluster back up.
Thanks!
Dan
I don't have an answer for you, Dan, but I am running into a similar problem. In my case, a bucket is not flushing to cache at all. I have already tried stopping and restarting persistence (no change), and several other attempts at solutions. The settings for the bucket that is not working look exactly the same as my default bucket, which is working. The primary difference is that we set TTL for the items that go into the broken bucket. In addition to not flushing to cache, by the way, items are also not properly flushing from RAM when they expire either, so the bucket just keeps growing in RAM until it runs out.
I haven't tried bouncing the cluster, yet, because I'm in a worse state that you are and suspect I would lose everything. I may either resort to rolling restarts of my nodes or returning to version 1.8.0 if I don't make some headway soon. I'll keep you posted if I learn anything about my problem, in case they are related. I'd be grateful if you could do the same.
Good luck!