Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Couchbase | Couchbase Server 1.8.x

Replica disk queue not draining but active is

1 reply [Last post]
  • Login or register to post comments
Mon, 12/10/2012 - 11:59
dan
Offline
Joined: 01/05/2011
Groups: None

Hardware Summary: 11 node 1.8.1 cluster with 128GB of ram per node (~97GB per node allocated to Couchbase)

Our main bucket (set up on dedicated 11216 port, no password) had 1.5 billion keys in it until about 550 million expired or were manually deleted over the past 6 weeks and it is now at 950 million keys.

Over the past week the Web UI is constantly timing out or not showing full stats for all servers. When I drilled down to the disk queue section I see the Replica disk queue has been showing ~7 million items for the past few weeks while the Active disk queue shows under 2000:

A few pictures:

http://imageshack.us/a/img820/1494/statsheg.png

I have no idea why this image reports 0 bytes available.

http://imageshack.us/a/img849/8299/clusterh.png

Here you can see some some bogus disk information on one of the servers. There are 3 servers that show bad disk information and the other 8 like fine.

http://imageshack.us/a/img541/7672/diskspace.png

A manual check of free space on the volume for that server shows it is fine:

:/opt/couchbase> df -h /dev/sda3
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       2.2T   60G  2.0T   3% /vm

All server hardware/operating systems are identical. Before this happened all 11 servers showed correct disk information. Block size is 4096 for the file system so it's not an issue of the SQLite files reaching 16GB each.

The pager has been running:

:/opt/couchbase> bin/cbstats localhost:11216 all | grep pager
 ep_exp_pager_stime:             39600
 ep_num_expiry_pager_runs:       33724
 ep_num_pager_runs:              0

I had tried to change the pager stime to an hour but it only wants to add whatever value I put in to the original value.

Here you can see the 7 million replica items:

:/opt/couchbase> bin/cbstats localhost:11216 all | grep queue
 ep_diskqueue_drain:             23737401049
 ep_diskqueue_fill:              23690077845
 ep_diskqueue_items:             7674564
 ep_diskqueue_memory:            306982560
 ep_diskqueue_pending:           597676270
 ep_queue_age_cap:               9900
 ep_queue_size:                  2579
 ep_tap_bg_fetch_requeued:       0
 ep_total_enqueued:              23702822810
 vb_active_queue_age:            10296983000
 vb_active_queue_drain:          11739287663
 vb_active_queue_fill:           11739288889
 vb_active_queue_memory:         56480
 vb_active_queue_pending:        108442
 vb_active_queue_size:           1412
 vb_pending_queue_age:           0
 vb_pending_queue_drain:         0
 vb_pending_queue_fill:          0
 vb_pending_queue_memory:        0
 vb_pending_queue_pending:       0
 vb_pending_queue_size:          0
 vb_replica_queue_age:           6679257552578000
 vb_replica_queue_drain:         11998113386
 vb_replica_queue_fill:          11950788956
 vb_replica_queue_memory:        306926080
 vb_replica_queue_pending:       597567828
 vb_replica_queue_size:          7673152

If I check memory on the servers they all look about like this:

             total       used       free     shared    buffers     cached
Mem:        129058     128573        485          0        569      56311
-/+ buffers/cache:      71691      57366
Swap:         2053        205       1848

It concerns me that the kernel cache is so high (50GB) and swap has been used.

Any ideas on how to get the replica queue to flush? Should I try stopping/starting persistence? The disk are not getting fully utilized, iostat bounces around between 1-30% but usually hangs around under 10%.

If I restart the cluster it usually takes 3 hours for it to warm back up. My concern is that items in the replica disk queue have not been written to disk then there is the possibility of the replicas not matching the actives when I bring the cluster back up.

Thanks!
Dan

Top
  • Login or register to post comments
Mon, 12/10/2012 - 12:14
rsbjeremy
Offline
Joined: 12/05/2012
Groups: None

I don't have an answer for you, Dan, but I am running into a similar problem. In my case, a bucket is not flushing to cache at all. I have already tried stopping and restarting persistence (no change), and several other attempts at solutions. The settings for the bucket that is not working look exactly the same as my default bucket, which is working. The primary difference is that we set TTL for the items that go into the broken bucket. In addition to not flushing to cache, by the way, items are also not properly flushing from RAM when they expire either, so the bucket just keeps growing in RAM until it runs out.

I haven't tried bouncing the cluster, yet, because I'm in a worse state that you are and suspect I would lose everything. I may either resort to rolling restarts of my nodes or returning to version 1.8.0 if I don't make some headway soon. I'll keep you posted if I learn anything about my problem, in case they are related. I'd be grateful if you could do the same.

Good luck!

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker