Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Membase | Membase Server 1.6.x

Basic Question: Why would there be disk fetches when there is unallocated RAM?

17 replies [Last post]
  • Login or register to post comments
Sun, 02/13/2011 - 16:57
esilverberg
Offline
Joined: 01/03/2011
Groups: None

 Folks,

I have a web app with 7 c1.mediums running 20 nginx/passenger processes apiece, each of which connects to a membase cluster of 3 membase m1.small instances. 

Periodically the global waiting queue spikes to 60 on every box, which typically means one of the three external depedencies my web app has - MYSQL, Membase or Redis - is blocking or stalled. 

I believe membase is the cause of this global waiting queue spike I am seeing now. When I look at the membase web console, I see several graphs that I cannot explain.

So, let me start with a basic question - why would there be any disk fetches if my membase cluster is only 1/3 used (1GB in use out of 3GB total)? Shouldn't disk fetching happen only when I am at or near the 3GB limit? 

Thanks!

-Eric

 

Top
  • Login or register to post comments
Mon, 02/14/2011 - 11:11
perry
Offline
Joined: 10/11/2010
Groups:

Eric, could you send over some screenshots showing the behavior that you're seeing?  It would also be good to run this command on the 3 Membase servers:

'/opt/membase/bin/ep_engine/management/stats <IP>:11210 all <bucket_name> [<bucket_password>]'

 

-Replace <IP> with the IP address of each Membase server.  Optionally, you can use 'localhost' if you run the command on each system individually

-Replace <bucket_name> with the name of your bucket.  If using the default bucket, you can omit this

-If you've got a bucket configured with SASL authentication you'll also need to supply the password on the end

 

Thanks

 

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Mon, 02/14/2011 - 12:18
esilverberg
Offline
Joined: 01/03/2011
Groups: None

OK, I have attached stats from all of my machines and a screenshot showing a bunch of disk reads yesterday. In an effort to combat this effect, I added a fourth membase machine, even though based on my reading of free RAM, that would make me extremely over-provisioned. Since I did that, the disk reads appear to have gone down to zero. 

Here are the screen shot and stats files

There are three screen shot files in this ZIP file:

  • One showing up to 25 disk fetches/sec yesterday
  • One showing day-long sparklines of all the other graph types provided by membase. Note the RAM ejections/sec graph, also very volatile yesterday. The Disk Write Queue size is also curious in its shape, though I do not know what that means. 
  • One showing my current cluster overview. Note that I had one less machine (m1.small) during the period I was experiencing so many disk reads

 

You can also see my memory bytes used is basically unchanged from yesterday/today. 

I have also attached the output of the stats command from my four machines. Again, only three were present during the period of disk reading yesterday -- the two 1B machines and one of the 1As. 

Finally, although I enabled SASL authentication, when I attempted to supply a username/password, I received an authentication error. When I ran the command without the username/password, it worked fine. I had ssh'd directly into the machine and was running the command with an IP address of localhost. 

Thank you,

-Eric

Top
  • Login or register to post comments
Mon, 02/14/2011 - 12:42
perry
Offline
Joined: 10/11/2010
Groups:

 Thanks Eric, I'm looking through those now.  At a first glance it appears you're running a very old version of the software and it would very helpful to upgrade and confirm this behavior with 1.6.5.  We've made a number of stability and performance improvements, specifically around the area of our disk activity.

 

Can you please upgrade and then let me know if this issue is resolved?

 

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Mon, 02/14/2011 - 12:47
esilverberg
Offline
Joined: 01/03/2011
Groups: None

 OK I will upgrade. For the record, I selected the AMI available when searching on the words "membase" under community AMIs in Amazon. It would be great if you guys could push out a new AMI running your newer versions. 

Thanks,

-Eric

Top
  • Login or register to post comments
Mon, 02/14/2011 - 12:52
perry
Offline
Joined: 10/11/2010
Groups:

 Ahh yes, we'll need to get on that as soon as possible.

 

Thanks

 

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Mon, 02/14/2011 - 13:42
esilverberg
Offline
Joined: 01/03/2011
Groups: None

Also, FWIW, I just hit this bug attempting to roll my own instance using the latest Alestic 10.4 instance store ami:  ami-7000f019 

It seems to be OK using the latest Alestic 10.10 instance store ami:  ami-a6f504cf 

Cheers,

Eric

Top
  • Login or register to post comments
Mon, 02/14/2011 - 14:13
perry
Offline
Joined: 10/11/2010
Groups:

 Thanks Eric, we're hoping to have that bug fixed shortly in an upcoming release.

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Mon, 02/14/2011 - 21:34
esilverberg
Offline
Joined: 01/03/2011
Groups: None

 I have completed an upgrade to the latest membase - 1.6.5 - and I continue to see inexplicable disk writes and RAM ejections. I also see the same behavior in my app of random spikes in the global wait queue, as I explained in my first post to this thread. I have attached the server output per your instructions earlier and taken screenshots of the graphs that I do not understand. I have three membase servers, and about 50% of the RAM is unallocated. Let me know if there is additional diagnostic information I can provide. 

Here is a link to my membase diagnostic information. 

Also, to clarify what I mean by random spikes in my global wait queue, here is a graph of my inactive vs. waiting passenger processes across six hosts. Normally there should be between 15-20 inactive processes. When the wait queue gets to 60, requests are dropped. You can see these weird random spikes happening, and when I look at membase I see that its doing disk writes. 

Thank you,

Eric

 

 

Top
  • Login or register to post comments
Tue, 02/15/2011 - 12:39
esilverberg
Offline
Joined: 01/03/2011
Groups: None

NOTE: I have started a new thread because the title of this thread is no longer directly relevant to the issue I am seeing.

 A bit more data: I wrote some code to monitor the average latency experienced in each of my three components across all six of my front-end machines. I measure: the time to connect to MYSQL and run one query; the time to connect to REDIS and run a ping request, and the time to connect to memcached (moxi is running locally FYI) and run a single get request of an object that I know is not present in the cache. 

Check out this graph.

Choppiness in the memcached connect time, which seemed to correspond to a spike in my global wait queue. 

How can I debug this further to get to the bottom of this?

Thanks,

Eric

Top
  • Login or register to post comments
Tue, 02/15/2011 - 12:41
perry
Offline
Joined: 10/11/2010
Groups:

 Looking at those stats now...

 

At a high level, you can see via the UI stats that you've got a "relatively" low cache miss ratio which means you are pulling some active data from disk.  This is also evidenced by the 94.7% Resident item ratio which means that 94.7% of your dataset is "resident" in memory.   If you want to have your entire active dataset in memory, I think you're going to need just a little bit more RAM to do so.  The idea is that you have your "working set" live in memory and the rest spill over to disk.  In your case, you "may" want your whole dataset to live in RAM although if your performance isn't being impacted too badly by these spikes it might be reasonable to leave it as-is.  It doesn't seem like you've got a lot of disk fetches going on all the time which means most of your activity is being served quite well from RAM.

 

If you read through this page: http://wiki.membase.org/display/membase/Growing+Data+Sets+Beyond+Memory,  it explains our high and low water marks a little bit better and your stats "seem" to indicate that everything is working properly.  Once your memory goes above ~2.9GB (the high water mark) Membase will start ejecting data to make room for more.  The extra 1GB of RAM "can" be adjusted though we find it's best to leave a little headroom in there for spikes in traffic, etc.

 

I do see a recent spike in your disk write queue which was likely caused by replication being caught up after the upgrade.  That probably contributed to the memory growing just a bit over the high water mark and causing some RAM ejections.

 

Does that make sense Eric?  Anthing not "jiving" with you?

 

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Tue, 02/15/2011 - 12:44
esilverberg
Offline
Joined: 01/03/2011
Groups: None

 *Thank you* for taking a look at these stats - this all makes sense. I have started a new thread that documents the core issue I am experiencing, random spikes in my global wait queue - which I believe is being caused by membase per the graphs I have attached. I thought it might be because of the disk activity, but that may be a red herring, and am not sure where to go next. If you have any thoughts about why I may be experiencing these random delays, could you please reply (to the new thread I started, since that will be more easily found by Google if/when other people experience this same issue).

Many thanks,

Eric

Top
  • Login or register to post comments
Tue, 05/17/2011 - 17:08
RV
Offline
Joined: 05/17/2011
Groups: None

Hi
I am seeing similar behavior. I have 2 servers in the cluster and about 24mill records. I have allocated 4.6 GB per server for a total of about 9.4GB. In the cluster overview it says Total Allocated is 9.38 and in use is 6.82GB , so there is plenty leftover. RAM used for the bucket shows 72.7%.
I am updating a bunch of values using the replace function. The load on the membase server is at 6-8 and Disk fetches are in the hundreds. Ops/sec is about 2500. Since the entire data is in memory and I still have memory to spare, I don't understand why there are so many disk fetches.
I am using membase server 1.6.5.3

Thanks
RV

Top
  • Login or register to post comments
Tue, 05/17/2011 - 17:19
esilverberg
Offline
Joined: 01/03/2011
Groups: None

Hi,

If it helps you any, I can tell you my understanding of membase was fundamentally wrong when I wrote this question. I thought I was getting a distributed memached. In fact membase is a distributed persistent key/value store. Once you start to think of membase as a database, not as memcached, disk operation becomes required for some subset of its persistence operations.

I ended up ripping out membase and installing memcached directly on several boxes, and at the application level I write to multiple identical memcached instances to achieve resilience in the face of machine failure.

Good luck,
Eric

Top
  • Login or register to post comments
Tue, 05/17/2011 - 18:58
RV
Offline
Joined: 05/17/2011
Groups: None

Thanks Eric, that does make sense. However I was seeing 0 disk fetches when the data was first being inserted and tons of them on the updates. I would think the updates would fetch from memory and then write updates to disk, just like the inserts, I don't get why there are so many disk fetches. Also I see different behavior in server load when I use the set function vs. replace for the updates. Any thoughts?
I think we are prepared for the overhead of writes to disk with membase but I just want to be sure that it behaves in a more or less predictable manner.
Thanks
Riya

Top
  • Login or register to post comments
Wed, 05/18/2011 - 11:20
perry
Offline
Joined: 10/11/2010
Groups:

Riya, it sounds like at least some of your data has been pushed to disk. I'll have to investigate a bit more why you would see different behavior with 'set' versus 'replace' but it might also be good for you to review this page: http://techzone.couchbase.com/products/membase/1-7-beta

Basically, Membase purposefully leaves some headroom (about %75 by default) to handle spikes and to leave some RAM available for rebalancing. Even though you may have RAM left over, Membase will still have ejected some data to disk which is likely why you are seeing those disk fetches.

We are working on some features that will allow us to use our RAM more efficiently, but that is how the system is currently designed.

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Wed, 05/18/2011 - 11:34
RV
Offline
Joined: 05/17/2011
Groups: None

Thanks for the information Perry. I'll give it more memory and see if i have fewer disk fetches. And I will also look into the 1.7 version.

On the topic of adding memory, I tried adding memory to one of the servers in the cluster by removing it from the cluster and trying to add it back in with more memory, but if course it picks up the memory quota of the servers already in the cluster and so I cannot increase it. I was trying to increase memory without losing data, that is without dropping the entire cluster and bucket and reloading data. Is there a way to do this? So I want to be able to add memory to one server, put it back in the cluster, replicate the data over and then upgrade the next server, and so on, so that I don't have any downtime.
Thanks again
Riya

Top
  • Login or register to post comments
Wed, 05/18/2011 - 11:55
perry
Offline
Joined: 10/11/2010
Groups:

Riya, that's definitely the best way to do this. Once you've got all nodes with an increased amount of memory, you can raise the quota using our commandline:
"/opt/membase/bin/cli/membase cluster-init -c localhost:8091 -u -p --cluster-init-ramsize="

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker