Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Membase | Membase Server 1.7.x

Question about ep_tmp_oom_errors and failed writes

7 replies [Last post]
  • Login or register to post comments
Tue, 08/02/2011 - 12:44
loopforever
Offline
Joined: 07/14/2011
Groups: None

Hi all,

I am frequently running into issues "bulk" loading data into my membase cluster. I am fairly certain this behavior is mostly expected (from the wiki article "consequences of memory faster than disk"), so I'm cool with that. On my production hardware I can throw faster disks at the problem to mitigate the SERVER_ERROR / "temporary failure" exceptions that seem to get thrown when the disk write queue gets too large.

However, I have encountered a scenario which I took some screenshots of here:

http://imageshack.us/g/825/writefailures1.png/

One screenshot is of the monitor GUI for the bucket, the other is output produced by my client, which should be fairly self explanatory.

Basically, the disk write queue is empty, and every time I attempt to "set" an object, ep_tmp_oom_errors increases by 1. My client tries to gracefully back-off (up to 10 times) and waits longer between each attempt, but eventually, if the I get a server error / "temporary failure" 10 times, I give up on the write.

I also managed to collect mb_collectinfo output from all 3 nodes while this was happening if that is useful. I don't really have a complete enough understanding of membase to really understand what's going on under the covers that causes this to happen.

- Matt

Top
  • Login or register to post comments
Tue, 08/02/2011 - 12:49
loopforever
Offline
Joined: 07/14/2011
Groups: None

I should add, there seems to be items stuck in the TAP replication queue (ep_tap_replica_qlen) that just sit there for a really, really long time.

Top
  • Login or register to post comments
Thu, 08/04/2011 - 11:12
perry
Offline
Joined: 10/11/2010
Groups:

At a first glance, it appears that you don't have enough RAM in general. I'm not sure why we're not ejecting items that have already been written though.

Can you send the logs to perry@couchbase.com and also try to increase the RAM available (either by increasing the quota or adding another node) and retry?

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Thu, 08/04/2011 - 12:23
loopforever
Offline
Joined: 07/14/2011
Groups: None

Perry, I've emailed you the logs.

I increased the RAM per node from 512MB to 796MB and I was still able to reproduce this.

Let me know if there's any other details I can provide. Thanks!

Top
  • Login or register to post comments
Fri, 08/05/2011 - 10:57
perry
Offline
Joined: 10/11/2010
Groups:

According to the logs, 01 still has a number of items in its disk write queue which may be affecting our ability to 'eject' items in order to free up more RAM.

If you let the queue reach 0, are you able to put more data in?

If not, could you also look at the output of this command across all servers and see which one is increasing it's value when you get those errors on the client:
"/opt/membase/bin/mbstats :11210 all | grep tmp_oom"

You can also look in the UI for each server by going into the bucket screen, finding the tmp_oom stat and then clicking the little blue arrow next to it which will let you see that stat across all servers.

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Tue, 08/09/2011 - 07:39
loopforever
Offline
Joined: 07/14/2011
Groups: None

Thanks for the info Perry.

When the disk write queue is zero (nothing pending for any node). The writes still fail.

In my case, the ep_tmp_oom_errors value increases on all 3 nodes .

Basically, if cmd_set increases by 1, so too does ep_tmp_oom_errors. That is to say, every attempted write is associated with a "temp OOM error".

Cluster setup is:

3 nodes
512MB membased bucket
2 replicas

Just to show you some output from my script, writing 50000 32KB objects:

[msavona@unixdeva15:~/membase_scripts]$ ruby load 50000@32
Objects: 32KB x 50000 = 1562MB
SHA1:    abfbc93c32d4dfd737e570c2a98d69100f165479
---
Checkpoint: 5000
Checkpoint: 10000
Server is overloaded (at key 14023, ~438MB stored already), holding off for 5s (#1).
Server is overloaded (at key 14023, ~438MB stored already), holding off for 10s (#2).
Server is overloaded (at key 14027, ~438MB stored already), holding off for 5s (#1).
Server is overloaded (at key 14027, ~438MB stored already), holding off for 10s (#2).
Server is overloaded (at key 14027, ~438MB stored already), holding off for 15s (#3).
Checkpoint: 15000
Server is overloaded (at key 19357, ~604MB stored already), holding off for 5s (#1).
Server is overloaded (at key 19357, ~604MB stored already), holding off for 10s (#2).
Checkpoint: 20000
Server is overloaded (at key 22845, ~713MB stored already), holding off for 5s (#1).
Server is overloaded (at key 23162, ~723MB stored already), holding off for 5s (#1).
Server is overloaded (at key 24258, ~758MB stored already), holding off for 5s (#1).
Server is overloaded (at key 24258, ~758MB stored already), holding off for 10s (#2).
Checkpoint: 25000
Server is overloaded (at key 25656, ~801MB stored already), holding off for 5s (#1).
Server is overloaded (at key 25656, ~801MB stored already), holding off for 10s (#2).
Server is overloaded (at key 26331, ~822MB stored already), holding off for 5s (#1).
Server is overloaded (at key 26660, ~833MB stored already), holding off for 5s (#1).
Server is overloaded (at key 26660, ~833MB stored already), holding off for 10s (#2).
Server is overloaded (at key 26660, ~833MB stored already), holding off for 15s (#3).
Server is overloaded (at key 26670, ~833MB stored already), holding off for 5s (#1).
Server is overloaded (at key 26670, ~833MB stored already), holding off for 10s (#2).
Server is overloaded (at key 26670, ~833MB stored already), holding off for 15s (#3).
Server is overloaded (at key 26670, ~833MB stored already), holding off for 20s (#4).
Server is overloaded (at key 26671, ~833MB stored already), holding off for 5s (#1).
Server is overloaded (at key 26671, ~833MB stored already), holding off for 10s (#2).
Server is overloaded (at key 26672, ~833MB stored already), holding off for 5s (#1).
Server is overloaded (at key 26680, ~833MB stored already), holding off for 5s (#1).
Server is overloaded (at key 26680, ~833MB stored already), holding off for 10s (#2).
Server is overloaded (at key 26680, ~833MB stored already), holding off for 15s (#3).
Server is overloaded (at key 26682, ~833MB stored already), holding off for 5s (#1).
Server is overloaded (at key 26682, ~833MB stored already), holding off for 10s (#2).
Server is overloaded (at key 26682, ~833MB stored already), holding off for 15s (#3).
Server is overloaded (at key 26682, ~833MB stored already), holding off for 20s (#4).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 5s (#1).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 10s (#2).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 15s (#3).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 20s (#4).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 25s (#5).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 30s (#6).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 35s (#7).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 40s (#8).
Server is overloaded (at key 26683, ~833MB stored already), holding off for 45s (#9).

Top
  • Login or register to post comments
Mon, 08/15/2011 - 15:32
perry
Offline
Joined: 10/11/2010
Groups:

From the output, it looks like some data is definitely getting in there, but then stops at some point (likely when the memory is totally full). This is further supported by the fact that you can add more memory and it will continue.

I'm suspecting that tmp_oom actually means hard oom in this case and that there's a bug with which error we're sending back, I'll have to reproduce that here.

Can you start a clean test and send me the stats output again when it first starts erroring?

Thanks

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Mon, 08/15/2011 - 15:35
perry
Offline
Joined: 10/11/2010
Groups:

Hmm, another problem might be that you're changing the data too quickly for us to "eject" it to disk in order to make space for more. I see a pretty high disk write queue on some of the nodes, and there are still active and replica items "resident" in memory which means that we could reclaim space if we could eject them...but we can't.

Can you send me the stats output after the disk write queue is 0 but you are still getting tmp_oom?

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker