[MB-8199] [Doc'd] many concurrent view requests cause excessive resource consumption and even crash Created: 04/May/13  Updated: 09/Aug/13  Resolved: 09/Aug/13

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0.1, 2.1.0
Fix Version/s: 2.2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Matt Ingenthron Assignee: kzeller
Resolution: Fixed Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 4-core CPU, 16GB RAM, Linux

Operating System: Centos 64-bit

In response to many view requests against the scatter/gather view merger, a node can allocate so many resources that it will fail to recover.

In one case, this did cause many timeouts in the log leading to max_restart_intensity:
=========================SUPERVISOR REPORT=========================
     Supervisor: {local,ns_node_disco_sup}
     Context: shutdown
     Reason: reached_max_restart_intensity
     Offender: [{pid,<0.17237.774>},

Comment by Matt Ingenthron [ 04/May/13 ]
Note, I put this on 2.0.2 since I know it shouldn't be 2.1 and there does not appear to be a 2.0.3. I feared it would be lost if it didn't have a fixfor version. Please move as appropriate.
Comment by Maria McDuff (Inactive) [ 07/May/13 ]
per bug scrub, alk - can you chk if aleksey a. can take a look at this?
Comment by Aleksey Kondratenko [ 07/May/13 ]
We know this problem so I don't believe we should look again.

Fixing it for 2.0.2 feels a bit late but possible if really needed
Comment by Dipti Borkar [ 07/May/13 ]
When you say, "we know this problem" can you elaborate on it a bit more? With more customers using views, they are likely to hit this as well.
Can you help us understand the scenario a bit more? When this problem can happen? What is the probability of hitting this?
Comment by Aleksey Kondratenko [ 07/May/13 ]
If you send too many view requests to any node it'll swamp it and kill. I recall seeing that during pre-2.0 testing and there must be MB- somewhere.
Comment by Maria McDuff (Inactive) [ 09/May/13 ]
per bug triage, upgrading to blocker.
the fix is to throttle the requests and not to crash/terminate.
it's fine to be slow but not crash.
alk k to take a look for 2.0.2
Comment by Aliaksey Artamonau [ 16/May/13 ]
We merged a simple request that can be configured via internal settings: http://review.couchbase.org/26334.
Comment by Aleksey Kondratenko [ 16/May/13 ]
It should also be noted that given we don't have experience how well this approach works in production we decided to have "unlimited" as default limits.

We can try playing with that stuff in-house plus get some experience with customers after 2.0.2 is out and then we'll have enough data to enable it by default and set right limits.
Comment by Aleksey Kondratenko [ 16/May/13 ]
CHANGES text is here: http://review.couchbase.org/#/c/26361/2/CHANGES,unified
Comment by Matt Ingenthron [ 16/May/13 ]
Alk: we should request QE to develop a test for this. See it cause the problem in 2.0.1 and see it not cause the problem in 2.0.2, right? Assigning it to Maria for that purpose, then it should be closed perhaps when verified? Not sure what QE's process is here now.
Comment by Matt Ingenthron [ 16/May/13 ]
Maria: Can you work with the team on the appropriate way to test that this is fixed and won't cause other problems?
Comment by Maria McDuff (Inactive) [ 17/May/13 ]

pls verify by:
-instrumenting a test that sends many view requests. do manual first then automate (if you already have a test that does similar test scenario such as this, just tweak that and use it here for this verification testing).
-verifying no crashes happen. if you observe, slowness, note it here. slowness is ok.
-noting alk k's "unlimited" dflt limit set. verify all his changes on review link.
-using stable build of 2.0.2 which should be built tonight or tomorrow.
Comment by Dipti Borkar [ 17/May/13 ]
We also need to document this.

271 +* (MB-8199) REST and CAPI request throttler implemented.
272 +
273 + It's behavior is controlled by three parameters which can be set via
274 + /internalSettings REST endpoint:
275 +
276 + - restRequestLimit
277 +
278 + Maximum number of simultaneous connections each node should
279 + accept on REST port. Diagnostics related endpoints and
280 + /internalSettings are not counted.
281 +
282 + - capiRequestLimit
283 +
284 + Maximum number of simultaneous connections each node should
285 + accept on CAPI port. It should be noted that it includes XDCR
286 + connections.
287 +
288 + - dropRequestMemoryThresholdMiB
289 +
290 + The amount of memory used by Erlang VM that should not be
291 + exceeded. If it's exceeded the server will start dropping
292 + incoming connections.
293 +
294 + When the server decides to reject incoming connection because some
295 + limit was exceeded, it does so by responding with status code of 503
296 + and Retry-After header set appropriately (more or less). On REST
297 + port textual description of why request was rejected returned in a
298 + body. On CAPI port in CouchDB tradition a JSON object is returned
299 + with "error" and "reason" fields.
300 +
301 + By default all the thresholds are set to be unlimited.
Comment by kzeller [ 27/Jun/13 ]
Comment by Perry Krug [ 28/Jun/13 ]
Has QE verified that this does in fact solve the problem?
Comment by Perry Krug [ 28/Jun/13 ]
Karen, just one thing:
-[FIXED] The release notes link on page 352 points to "Adjusting Rebalance during Compaction” but should be "8.8.1. Limiting Simultaneous Node Requests" right?
Comment by Aleksey Kondratenko [ 28/Jun/13 ]
renamed to ticket's subject to more accurately reflect it's nature. I.e. this is not strictly speaking a leak.
Comment by kzeller [ 01/Jul/13 ]
Fixed link: In the past too many simultaneous views requests could overwhelm a node.
You can now limit the number of simultaneous requests a node can receive. For
more information, see REST-API, see <xref linkend="couchbase-restapi-request-limits" />.

removing labeling until relevant for 2.2
Comment by Chiyoung Seo [ 09/Aug/13 ]

Please close it if it is already resolved.
Generated at Mon Nov 24 16:02:04 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.