beam.smp 120% CPU usage?
I have CouchBase running on 3 large instances at Amazon, and beam.smp seems to be pegging the CPU.
The cluster is not getting a lot of traffic right now, roughly 250 ops/s, and 8k items.
Is there any reason why beam.smp would be using so much CPU?
On which release are you running?
Is there anything interesting in the logs?
We may need to get you to gather some more data. This isn't something we'd expect, but we have observed it without getting much data on it in some of the developer previews.
I'm running developer preview 3. There's nothing interesting in the logs. I did experience two node failures yesterday (the underlying EC2 instance crashed), but that's unrelated to the CPU load (the problem was the same before the crash and after I restored the machine). I've even attempted gracefully removing every node and re-adding it.
The CPU usage correlates with activity, that is, the average goes up with the requests. The CPU itself is very bursty. It's not "pegged" as I initially described it (it just looked that way in top at first).
Right now the hourly view of the sole bucket shows the cluster is doing about:
700 ops/s
38 updates/s
586 gets/s
43 sets/s
There's about 20K items, and the disk write queue is about 29.
Looking at the minutely view, gets do spike up around 2.5k/s, updates over 200/s, sets over 200/s, and disk write queue up to 100.
This is translating into CPU consumption of 30 to 40% with top set to a 10 second delay (looking at %us only, %wa is showing significant io delay to the ephemeral storage).
At peak time, which will hit in a few hours, the numbers all triple. Looking at the daily graph for yesterday's peak:
1.25k ops/s
45 updates/s
1.25k gets/s
85 sets/s
120 disk queue
This results in the CPU usage being up to 160% (around 80% of both cores) on each of the 3 Large EC2 instances.
Is just that more hardware is needed?
I'd give you the Diagnostic Report, but it's around 100 MB.
Hey Mark, wanted to jump in here as well and understand the usage a bit more.
-What client language and library version are you using?
-Any views created? If so, how many and what do they look like (don't need specifics, just map and/or reduce?)
-Also, does the CPU go down (not to 0, but near there) when the load is off or down?
-You should be able to compress the diagnostic report quite significantly before sending. Also, you should be able to run "/opt/couchbase/bin/cbcollect_info " which will produce an already zipped file with a variety of logging and stats info...we'd be curious to look at that as well if possible.
Perry
I'm using the memcached PHP library included in Ubuntu Oneiric.
There are no views.
The CPU does go down when load goes down.
Where should I send the output of cbcollect_info? I'll run that when I'm in the office tomorrow.
You can send to perry@couchbase.com.
Given that the CPU seems directly related to the load, this is likely connected to some known issues currently with our JSON parsing and disk writing processes. We're planning on addressing them with future DP releases...well before GA.
We're still quite interested in seeing the results so we can confirm, but I think we've got a pretty good handle on it.
Even with high CPU, we've seen that the application-side performance is actually quite good...does that still track with what you're seeing?
Any other comments or feedback regarding the release you'd like me to share with the team?
Thanks Mark!
Perry
If it does help, `top` showing threads shows it's exactly the same two threads using most of the CPU until I restart CouchBase.
Things seem responsive on the application side. I haven't experienced any problems.
I do seem to experience about one node crash per day (always the same one, happens at random times in the evening). It restarts and reconnects to the cluster. I'd love it if I wouldn't have to manually re-add the node and rebalance the cluster. Instead of having to babysit an option to automatically rejoin and rebalance after a period of time (say a few minutes) would be amazing. I'm sure the CPU usage issues will be worked out, but not having automatic full recovery is the number one reason why I wouldn't continue to use CouchBase. Galera already does it, and it works beautifully after a EC2 instance failure.
Also, I'd like a better process for running CouchBase with changing IPs/in the cloud. Yes, I've read the bit about running in EC2, but it's so much more complicated than the wonderfully simple basic install and configuration. I'd put the priority on automatic full recovery though.
Mark, let's dig into your auto recovery a bit more. If you don't fail the node over, it will come back online and automatically rejoin the cluster when ready. The data on that node will be temporarily unavailable while that's happening, but if you're okay with erroring on some requests for a brief period of time, that should work (and perhaps you want to turn off auto-failover if it's on?)
On the other hand, if the node does get failed over, I firmly believe that it is in the best interest of the user to have the rejoin+rebalance be administratively controlled. There are enough situations (both in EC2 and outside) where the software could easily make the "wrong" decision, end up putting increased load on the cluster and therefore causing a cascading failure. I certainly understand that in certain environments it may be desirable, but many of our customers specifically enjoy the administrative control that we give them.
For instance, I don't know if you've worked with MongoDB or Cassandra at all...but they both *only* have this notion of automatic leaving and joining the cluster. That causes lots of data to be moved around unneccessarily, putting increased load on the cluster and even creating hotspots due to their inefficient sharding algorithms.
Couchbase is very strongly focused on providing production-ready software that can be run at both large and small scale. While I certainly agree that your suggestion would make for slightly easier software to manage, I suspect that the benefits will be outweighed by the potential for further disruption.
It may be more advisable for us to help you write a simple script that will monitor your cluster and can apply this logic if it fits your environment. Our software is *completely* controllable via our command-line interface and/or REST API so it is actually quite easy to implement the actual rejoin+rebalance. The hard part is not *doing* the action, it's deciding when *not* to.
I'm definitely not saying we'll never do this (see our history of auto-failover...a very similar feature) but wanted to give you some insight into why we haven't already.
Let's keep the conversation up, I'm going to file an enhancement request so that product management is aware of this request.
The IP situation is another interesting one that we've not found a good way of tackling. A significant part of the trouble comes from the way Erlang identifies itself to other nodes. Without some sort of central lookup repository (and therefore more management overhead, other things to fail, etc) we haven't found a better way of working around this other than the current instructions. I agree that they could be made easier to implement.
Perry
Oh, also...do those logs you sent me include the timeframe of one of these crashes you've experienced? We certainly want to make sure we investigate and address those as well.
Perry
Perry
The crashes are actually the underlying EC2 instance crashing. I think Amazon is having issues in one of their availability zones in us-east-1 (it's always the instance in the same availability zone). I don't think the crashes are actually related to CouchBase.
Gotcha, thanks Mark.
Perry
We're using CouchBase for two things:
1. Sessions (I don't want a memcache instance failure to log a user out, and the ability to scale is awesome)
2. Memcache replacement (I haven't had time to split up our memcache/CouchBase instance into pools, but this stuff doesn't matter if it fails)
I'm using 1 replica (3 node cluster), so the data should still be available while the failure has occurred. In the future, I'll continue using replicas for the session storage, but not for the memcache replacement stuff (it's easy enough to recreate, just significantly increases the DB load while the missing values are being regenerated).
On the other hand, if the node does get failed over, I firmly believe that it is in the best interest of the user to have the rejoin+rebalance be administratively controlled. There are enough situations (both in EC2 and outside) where the software could easily make the "wrong" decision, end up putting increased load on the cluster and therefore causing a cascading failure. I certainly understand that in certain environments it may be desirable, but many of our customers specifically enjoy the administrative control that we give them.
Having one node rejoin/rebalance at a time may be a good way to throttle issues. I think it's best to always have extra capacity provisioned and active to tolerate a failure and a repair. I think it should be up to the administrator to decided whether automatic repair is enabled or not; I'd much rather a system that takes care of itself than getting phone calls at night. :-)
For instance, I don't know if you've worked with MongoDB or Cassandra at all...but they both *only* have this notion of automatic leaving and joining the cluster. That causes lots of data to be moved around unneccessarily, putting increased load on the cluster and even creating hotspots due to their inefficient sharding algorithms.
I haven't used either (yet). I started looking at MongoDB and got scared away by its scaling issues. I here Cassandra's creators, Facebook, don't even use it any longer. I may be using Riak for another project soon. It's possible that CouchBase would also work well. Basically, it'll be a reimplementation of this: http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
Couchbase is very strongly focused on providing production-ready software that can be run at both large and small scale. While I certainly agree that your suggestion would make for slightly easier software to manage, I suspect that the benefits will be outweighed by the potential for further disruption.
Again, I think making it optional is best. We're a startup and don't have a night watch ;-)
It may be more advisable for us to help you write a simple script that will monitor your cluster and can apply this logic if it fits your environment. Our software is *completely* controllable via our command-line interface and/or REST API so it is actually quite easy to implement the actual rejoin+rebalance. The hard part is not *doing* the action, it's deciding when *not* to.
Now that's an option. As the issue seems to happen when hardware randomly reboots, I could run a script on @reboot in cron.
I'm definitely not saying we'll never do this (see our history of auto-failover...a very similar feature) but wanted to give you some insight into why we haven't already.
It does make sense. And your conservative approach speaks volumes about the philosophy you take towards writing solid software.
Let's keep the conversation up, I'm going to file an enhancement request so that product management is aware of this request.
Will do!
The IP situation is another interesting one that we've not found a good way of tackling. A significant part of the trouble comes from the way Erlang identifies itself to other nodes. Without some sort of central lookup repository (and therefore more management overhead, other things to fail, etc) we haven't found a better way of working around this other than the current instructions. I agree that they could be made easier to implement.
Perry
They certainly didn't have cloudy environments when Erlang was invented, so I can't blame Erlang's designers for thinking IPs never change. =)
I think the current situation is manageable, actually. As long as you don't stop an instance, IPs don't change. If you do a soft reboot, or AWS moves your instance to new hardware, the IP stays the same.
Thanks for you help, Perry!
-Mark
All that sounds good Mark, thanks again.
One further comment is regarding the ability to read from replicas. In the current version, we specifically disallow it to ensure our promise of immediate consistency. We're a CP system...if only it were as simple as that.
Going forward, we're already working on ways to allow replicas to be read from. It will likely be a very specifically separate command from the application so that there is no confusion about where data is coming from. Yes, a bit of re-coding to implement, but making things much more deterministic is all good in my world.
That won't quite address your self-healing request (http://www.couchbase.org/issues/browse/MB-4642) but it would allow you to not *have* to failover a node just to be able to read the data available in replicas. Note that it definitely won't allow writing to replicas (while they're replicas)...that's just begging for disaster.
Thanks again Mark, really happy to hear you're having a relatively positive experience even with our pre-release software...can't wait until it's GA for you!
Please keep us in the loop with your progress and any feelings you have towards the suitability of Couchbase for your projects, and/or the other solutions you're looking at.
Perry
One further comment is regarding the ability to read from replicas. In the current version, we specifically disallow it to ensure our promise of immediate consistency. We're a CP system...if only it were as simple as that.
If a node auto-fails, doesn't the replica take over as the authoritative value for the key? Or?
Going forward, we're already working on ways to allow replicas to be read from. It will likely be a very specifically separate command from the application so that there is no confusion about where data is coming from. Yes, a bit of re-coding to implement, but making things much more deterministic is all good in my world.
That won't quite address your self-healing request (http://www.couchbase.org/issues/browse/MB-4642) but it would allow you to not *have* to failover a node just to be able to read the data available in replicas. Note that it definitely won't allow writing to replicas (while they're replicas)...that's just begging for disaster.
This would also be useful for another reason: lower latency and cost in multiple availability zones. I think I talked about this in another topic in the forums. Having locality awareness -- making sure the replicas are stored in other availability zones -- would also be very nice.
Thanks again Mark, really happy to hear you're having a relatively positive experience even with our pre-release software...can't wait until it's GA for you!
When is a rough estimate when you think GA will be released? Our growth plans involve two orders of magnitude increase in traffic this year.
I'm not afraid of new software, especially if it heals itself. I try to design everything with planned failures of every component.
Please keep us in the loop with your progress and any feelings you have towards the suitability of Couchbase for your projects, and/or the other solutions you're looking at.
Perry
Will do!
If a node auto-fails, doesn't the replica take over as the authoritative value for the key? Or?
Yes, once failover actually takes place (whether automatic or manual) the replica copy becomes the authoritative copy. But it's important to realize that that is now the active location for the data, and all requests (both read and write) will be directed there. That's how will still maintain immediate consistency. In some cases you might want it to switch back...but again there's the argument over control, predictability and various corner cases where you wouldn't want that to happen.
Going forward, we're already working on ways to allow replicas to be read from. It will likely be a very specifically separate command from the application so that there is no confusion about where data is coming from. Yes, a bit of re-coding to implement, but making things much more deterministic is all good in my world.
That won't quite address your self-healing request (http://www.couchbase.org/issues/browse/MB-4642) but it would allow you to not *have* to failover a node just to be able to read the data available in replicas. Note that it definitely won't allow writing to replicas (while they're replicas)...that's just begging for disaster.
This would also be useful for another reason: lower latency and cost in multiple availability zones. I think I talked about this in another topic in the forums. Having locality awareness -- making sure the replicas are stored in other availability zones -- would also be very nice.
Also something we're aware of and already planning on for a release shortly after 2.0.
When is a rough estimate when you think GA will be released? Our growth plans involve two orders of magnitude increase in traffic this year.
The current plan is for early next quarter.
Thanks again Mark, please keep in touch.
I ended up pulling CouchBase Saturday morning. We started getting session errors where "it" was failing. I moved back to a stand-alone memcache instance.
However, looking more in depth into the problem, it seems to have been caused by a crashed Moxi on one of the web servers. Obviously, if a request where made to the web server without the functioning Moxi would fail to retrieve the session.
I'm rewriting our session handling to be backed by MySQL, and I'll test CouchBase again as an intermediary cache.
Hi Mark,
just wondering if what you have observed on the cluster was related to having multiple views running while you were loading data through moxi ?
-Farshid
No, I had no views.
I've been using memcache buckets for a while now though and beam.smp CPU usage seems reasonable. It's only the CouchBase buckets that cause trouble.
Mark,
just wondering if you opened a new issue in couchbase jira and attached the diagnostics there?
it would be interesting to know what specific process in erlang is actually taking up so much cpu.
if you have couchbase 2.0 installed it would also be easy to tail on the logs under /opt/couchbase/var/lib/couchbase/logs/log.*
-Farshid
Investigating with top showing threads, there appears to be two beam.smp threads that are responsible for the CPU consumption.
Removing a node from the cluster drops CPU consumption down to near zero, but the same two threads continue to run frequently. Adding the node back changes nothing until rebalancing the cluster when the CPU usage jumps back up (same two threads again).
The threads aren't running full out all the time. Instead, they seem to spike between low usage and high usage.
The storage is using local ephemeral. The data is PHP sessions and other memcache-replacement usage.