Question about ep_tmp_oom_errors and failed writes
Hi all,
I am frequently running into issues "bulk" loading data into my membase cluster. I am fairly certain this behavior is mostly expected (from the wiki article "consequences of memory faster than disk"), so I'm cool with that. On my production hardware I can throw faster disks at the problem to mitigate the SERVER_ERROR / "temporary failure" exceptions that seem to get thrown when the disk write queue gets too large.
However, I have encountered a scenario which I took some screenshots of here:
http://imageshack.us/g/825/writefailures1.png/
One screenshot is of the monitor GUI for the bucket, the other is output produced by my client, which should be fairly self explanatory.
Basically, the disk write queue is empty, and every time I attempt to "set" an object, ep_tmp_oom_errors increases by 1. My client tries to gracefully back-off (up to 10 times) and waits longer between each attempt, but eventually, if the I get a server error / "temporary failure" 10 times, I give up on the write.
I also managed to collect mb_collectinfo output from all 3 nodes while this was happening if that is useful. I don't really have a complete enough understanding of membase to really understand what's going on under the covers that causes this to happen.
- Matt
At a first glance, it appears that you don't have enough RAM in general. I'm not sure why we're not ejecting items that have already been written though.
Can you send the logs to perry@couchbase.com and also try to increase the RAM available (either by increasing the quota or adding another node) and retry?
Perry
Perry, I've emailed you the logs.
I increased the RAM per node from 512MB to 796MB and I was still able to reproduce this.
Let me know if there's any other details I can provide. Thanks!
According to the logs, 01 still has a number of items in its disk write queue which may be affecting our ability to 'eject' items in order to free up more RAM.
If you let the queue reach 0, are you able to put more data in?
If not, could you also look at the output of this command across all servers and see which one is increasing it's value when you get those errors on the client:
"/opt/membase/bin/mbstats :11210 all | grep tmp_oom"
You can also look in the UI for each server by going into the bucket screen, finding the tmp_oom stat and then clicking the little blue arrow next to it which will let you see that stat across all servers.
Perry
Thanks for the info Perry.
When the disk write queue is zero (nothing pending for any node). The writes still fail.
In my case, the ep_tmp_oom_errors value increases on all 3 nodes .
Basically, if cmd_set increases by 1, so too does ep_tmp_oom_errors. That is to say, every attempted write is associated with a "temp OOM error".
Cluster setup is:
3 nodes
512MB membased bucket
2 replicas
Just to show you some output from my script, writing 50000 32KB objects:
[msavona@unixdeva15:~/membase_scripts]$ ruby load 50000@32 Objects: 32KB x 50000 = 1562MB SHA1: abfbc93c32d4dfd737e570c2a98d69100f165479 --- Checkpoint: 5000 Checkpoint: 10000 Server is overloaded (at key 14023, ~438MB stored already), holding off for 5s (#1). Server is overloaded (at key 14023, ~438MB stored already), holding off for 10s (#2). Server is overloaded (at key 14027, ~438MB stored already), holding off for 5s (#1). Server is overloaded (at key 14027, ~438MB stored already), holding off for 10s (#2). Server is overloaded (at key 14027, ~438MB stored already), holding off for 15s (#3). Checkpoint: 15000 Server is overloaded (at key 19357, ~604MB stored already), holding off for 5s (#1). Server is overloaded (at key 19357, ~604MB stored already), holding off for 10s (#2). Checkpoint: 20000 Server is overloaded (at key 22845, ~713MB stored already), holding off for 5s (#1). Server is overloaded (at key 23162, ~723MB stored already), holding off for 5s (#1). Server is overloaded (at key 24258, ~758MB stored already), holding off for 5s (#1). Server is overloaded (at key 24258, ~758MB stored already), holding off for 10s (#2). Checkpoint: 25000 Server is overloaded (at key 25656, ~801MB stored already), holding off for 5s (#1). Server is overloaded (at key 25656, ~801MB stored already), holding off for 10s (#2). Server is overloaded (at key 26331, ~822MB stored already), holding off for 5s (#1). Server is overloaded (at key 26660, ~833MB stored already), holding off for 5s (#1). Server is overloaded (at key 26660, ~833MB stored already), holding off for 10s (#2). Server is overloaded (at key 26660, ~833MB stored already), holding off for 15s (#3). Server is overloaded (at key 26670, ~833MB stored already), holding off for 5s (#1). Server is overloaded (at key 26670, ~833MB stored already), holding off for 10s (#2). Server is overloaded (at key 26670, ~833MB stored already), holding off for 15s (#3). Server is overloaded (at key 26670, ~833MB stored already), holding off for 20s (#4). Server is overloaded (at key 26671, ~833MB stored already), holding off for 5s (#1). Server is overloaded (at key 26671, ~833MB stored already), holding off for 10s (#2). Server is overloaded (at key 26672, ~833MB stored already), holding off for 5s (#1). Server is overloaded (at key 26680, ~833MB stored already), holding off for 5s (#1). Server is overloaded (at key 26680, ~833MB stored already), holding off for 10s (#2). Server is overloaded (at key 26680, ~833MB stored already), holding off for 15s (#3). Server is overloaded (at key 26682, ~833MB stored already), holding off for 5s (#1). Server is overloaded (at key 26682, ~833MB stored already), holding off for 10s (#2). Server is overloaded (at key 26682, ~833MB stored already), holding off for 15s (#3). Server is overloaded (at key 26682, ~833MB stored already), holding off for 20s (#4). Server is overloaded (at key 26683, ~833MB stored already), holding off for 5s (#1). Server is overloaded (at key 26683, ~833MB stored already), holding off for 10s (#2). Server is overloaded (at key 26683, ~833MB stored already), holding off for 15s (#3). Server is overloaded (at key 26683, ~833MB stored already), holding off for 20s (#4). Server is overloaded (at key 26683, ~833MB stored already), holding off for 25s (#5). Server is overloaded (at key 26683, ~833MB stored already), holding off for 30s (#6). Server is overloaded (at key 26683, ~833MB stored already), holding off for 35s (#7). Server is overloaded (at key 26683, ~833MB stored already), holding off for 40s (#8). Server is overloaded (at key 26683, ~833MB stored already), holding off for 45s (#9).
From the output, it looks like some data is definitely getting in there, but then stops at some point (likely when the memory is totally full). This is further supported by the fact that you can add more memory and it will continue.
I'm suspecting that tmp_oom actually means hard oom in this case and that there's a bug with which error we're sending back, I'll have to reproduce that here.
Can you start a clean test and send me the stats output again when it first starts erroring?
Thanks
Perry
Hmm, another problem might be that you're changing the data too quickly for us to "eject" it to disk in order to make space for more. I see a pretty high disk write queue on some of the nodes, and there are still active and replica items "resident" in memory which means that we could reclaim space if we could eject them...but we can't.
Can you send me the stats output after the disk write queue is 0 but you are still getting tmp_oom?
I should add, there seems to be items stuck in the TAP replication queue (ep_tap_replica_qlen) that just sit there for a really, really long time.