High cpu usage in memcached process (couchbase 2.0)
Hi
I am testing couchbase 2.0 as a "persistent" memcached alternative in a semi-production environment.
I set up a cluster made of 4 m1.large servers, with a single bucket and a dedicated 4GB of RAM per node (16GB total).
The bucket is configured as "couchbase" and not simply "memcached" because we would like to have strict expiration times.
On all the client I setup the moxi-server v1.8, so that they can connect to localhost using the memcached protocol, and it deals with connecting to the right server for fetching the requested data.
The rate of ops is currently oscillating between 600 ("idle time") and 1000 per second(peak time).
We are testing this since mid december, but suddenly on december 30th the CPU usage of the memcached process started oscillating between 100 and 111%.
There are no evident issues with the service performance (looking at the response time), but the cpu usage is alarming, since in this way we feel not able to detect if the cluster is really operating at maximum capacity or not.
Thanks,
Francesco
I am also seeing similar messages in the log, but it does not happen when the memcached load spikes. Really curious that it seems we are the only two having this issue.
Are you using ubuntu 1204, too?
I'm running Ubuntu 10.04.
Yeah, it is very strange.
We ran membase-1.7.2 for about a year straight. The load on the server was never above .2. Then we upgraded to v2.0.0 and a day later all our dedicated couchbase servers are running at 2-3 load average. definitely something wrong, but hard to say what. It does not happen on our staging or test servers (setup identically to our production server), so it must be somewhat load related. There is nothing interesting in the statistics in the ui to correspond with the load spikes and everything appears normal and healthy except for the high cpu. Also, once our cpu spikes it never goes back down either. You can even delete all memcached buckets in the ui, but CPU is still 100%. I have to bounce couchbase for the cpu to return to normal, which will fix it for anywhere from 2-24 hours before memcached process goes to 100%+ again.
I'm seeing the same behavior.
Three-node cluster, newest community release, light use (testing only right now, maybe 100 operations/sec) and one node has been pegged at 100% for over 18 hours now. Memcached process is the culprit.
Also, I downgraded our production cluster 2 days ago back to 1.7.2 and there are no more cpu spikes now. So definitely some issue that was introduced in one of the newer versions.
A little more information:
This seems to be associated with views and their upkeep. I've had a new cluster going for several days now with bulk loading, rebalancing and compacting with no spiky CPU stuff. The only difference seems to be that I don't have views yet.
Hope this helps :)
I had the cluster without any view (both dev and prod) for some days, but the problem didn't disappear for me...
You mean you've *never* had views in the cluster, or that the views have been removed?
The cluster I have is brand new (I'm in testing so clusters are getting hosed wholesale) and has never had a view on it.
UPDATE:
Well never mind -- did a rebalance and now the CPU's pegged at 100% again. :)
I've never had views in our cluster either.
Hello,
To help us to understand the source of the issue it would be great if you could share your logs with us.
For this please follow the steps that are located on this wiki page:
http://www.couchbase.com/wiki/display/couchbase/Working+with+the+Couchba...
Let me know when it is uploaded.
Regards
Hi,
I just uploaded our logs.
Thanks
I am experiencing similar issues - Couchbase community server 2 - latest release - no views.
Hi there,
any news about this issue?
Have my log files been meaningful?
Thanks
Not really, and we were not able to reproduce it on our environment.
I continue to investigate (I am sending you a PM to see how we can work closer with you)
Regards
Tug,
if you still have trouble reproducing, i could work with you too in mid february some time. we downgraded back to membase 1.7.2 for the time being and are in production lockdown until feb so we can't test 2.0 again until then.
-Chris
Sure.
Feel free to send me a PM or an email or add my skype contact.
The servers are on amazon aws so there will be no problem to allow you ssh on them; I can support you on the task since it is quite a priority for us to decide to go OK/NOK with couchbase 2 in the next month or so.
In case it helps, these are the steps we followed to install couchbase on our ubuntu 12.04 server base image (it has been run as root):
#######################
export SERVER=servername.ourdomain.com
wget http://packages.couchbase.com/releases/2.0.0/couchbase-server-community_...
dpkg -i couchbase-server-community_x86_64_2.0.0.deb
/etc/init.d/couchbase-server stop
mv /opt/couchbase/bin/couchbase-server /opt/couchbase/bin/couchbase-server.orig
cat /opt/couchbase/bin/couchbase-server.orig | sed "s/-run ns_bootstrap/-run ns_bootstrap -name ns_1@$SERVER/g" > /opt/couchbase/bin/couchbase-server
chown couchbase:couchbase /opt/couchbase/bin/couchbase-server
chmod 755 /opt/couchbase/bin/couchbase-server
echo $SERVER > /opt/couchbase/var/lib/couchbase/ip
rm -rf /opt/couchbase/var/lib/couchbase/data/*
rm -rf /opt/couchbase/var/lib/couchbase/mnesia/*
rm -rf /opt/couchbase/var/lib/couchbase/config/config.dat
echo "* soft nofile 16536 " >> /etc/security/limits.conf
echo "* hard nofile 32768" >> /etc/security/limits.conf
/etc/init.d/couchbase-server start
#######################
Regards
We are also receiving emails alerts with subject:
Couchbase Server alert: IP address changed
and body
IP address seems to have changed. Unable to listen on 'ns_1@01.couch.ourdomain.com'.
It happens roughly once a day per server, randomly.
It is clearly wrong because we are on AWS and the ip address should not change...
Hello Francesco,
We are fixing this in the product to avoid issue with IP address in 2.0.1
Also when running on the Cloud/Amazon you have to be sure that your server is using the correct IP address as documented
- http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-bestpractic...
- http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-bestpractic...
Do you still have the issue?
Regards
Hi yes,
I followed the rules above using as name fqdn configured as CNAME of the amazon public dns name (e.g. the one that looks like ec2-x-y-z-w.compute1.amazonaws.com) and it still happens every 3-4 days.
I installed another cluster, under the same load, using the same steps amazon public dns name as server name and have not received any mail yet, but it's up and running only since two days.
thanks
Francesco
Hi,
we have the same issue here. According to the last post on http://www.couchbase.com/forums/thread/diagnostics-memcached-running-hot... i reboot one of the two node. Since the reconnection failed i fail over on the remaining node and then did a rebalance, and so far the rebooted node consume much less cpu. We do not know if it's the rebalance or the reboot which helped.
The log message when it failed to reconnect was this one if it can help:
Port server memcached on node 'ns_1@192.168.46.204' exited with status 134. Restarting. Messages: Mon Feb 18 17:58:25.856450 CET 3: Trying to connect to mccouch: "localhost:11213" Mon Feb 18 17:58:25.857209 CET 3: Connected to mccouch: "localhost:11213" Mon Feb 18 17:58:25.914571 CET 3: Extension support isn't implemented in this version of bucket_engine Mon Feb 18 17:58:25.916258 CET 3: Trying to connect to mccouch: "localhost:11213" Mon Feb 18 17:58:25.916765 CET 3: Connected to mccouch: "localhost:11213" Mon Feb 18 17:58:25.972524 CET 3: Extension support isn't implemented in this version of bucket_engine Mon Feb 18 17:58:25.994470 CET 3: Failed to load mutation log, falling back to key dump Mon Feb 18 17:58:26.052894 CET 3: Failed to load mutation log, falling back to key dump Mon Feb 18 17:58:26.107872 CET 3: metadata loaded in 188 ms Mon Feb 18 17:58:26.154400 CET 3: 4 items loaded from access log, completed in 41 usec Mon Feb 18 17:58:26.156707 CET 3: warmup completed in 234 ms Mon Feb 18 17:58:27.840482 CET 3: TAP (Producer) eq_tapq:replication_ns_1@192.168.46.68 - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0 Mon Feb 18 17:58:27.840582 CET 3: TAP (Producer) eq_tapq:replication_ns_1@192.168.46.68 - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0 Mon Feb 18 17:58:34.925510 CET 3: TAP (Consumer) eq_tapq:anon_1 - disconnected EOL on stdin. Initiating shutdown memcached: src/stored-value.cc:154: mutation_type_t HashTable::insert(const Item&, bool, bool): Assertion `itm.getCas() != static_cast<uint64_t>(-1)' failed.
I just wanted to add a few notes to this thread.
1. The high cpu issue should be fixed in 2.0.1. We tracked this down to a problem with our background fetcher.
2. This forum post was linked to bug MB-7695 which was about seeing mccouch errors with error number 86. This error is a tmp fail error which occurs when the compactor is working on a file that the database is trying to write to. It is not anything to be worried about since the operations that failed are re-queued and tried again very soon after receiving the error. This should only be viewed as a problem if these error messages are seen very close together (like every 1ms).
There were a lot of posts on this question and I didn't have time to look through all of them. Please post back if there are still any unresolved questions you guys still have.
When is 2.0.1 due to be released?
Thanks,
Marvin
I don't think there is a hard release date yet, but it should be released in the next few weeks. It's currently code complete and as soon as it makes it through QE it will be released.
Hello,
We very recently upgraded to 2.0.1 but still seeing memcached process eating >100% cpu at all times.
It was ok for an hour or two after upgrade (around 20%) and now it's bad again..
Can anyone advice on this?
Is this issue being even investigated at all?
Jacek
Jacek,
Could you please describe a little bit your environment?
- OS
- size of the cluster?
- number of documents/keys?
- volume of data?
- type of activity?
-....
Anything in the logs?
regards
I see this was released last month for the enterprise edition, is there any plan to release a community edition update in the near future?
Thanks
Marvin.
Hello,
You message was caught in the spam filter, this is why it just appears now in the forum.
You have probably noticed that the Couchbase 2.0.1 is now available for all platform CE and EE.
You can download it from http://www.couchbase.com/download
Regards
Hi same issues here,
Ubuntu 12.04 64 bit on esx both 2.0.0 and 2.0.1 public version generate constant 100% cpu usage.
I remove it and setup 2.0.1 enterprise, it look stable...
Hi,
We are seeing the same behavior. On our 3-node cluster, all the memcached processes are eating up 100-120% cpu. Also we do not observe any performance degradation, but the cpu usage is alarming. If I restart couchbase, the cpu usage returns to normal (10-20%) for several hours, but then will spike up again at a random time. It does not appear to spike at the same time on all nodes and some nodes will run for a day or more at normal cpu before spiking to 100%. We are running 2.0 community. The total ops on our cluster is 400-800sec, with memcached ops being about 200 ops/sec. We have 3 couchbase buckets, and 2 memcached buckets. There are a couple of these lines in our memcached logs, but they don't appear to coincide with the cpu spiking:
Mon Jan 7 03:36:35.580103 PST 3: Received error[86] from mccouch for unknown
Mon Jan 7 03:36:35.580149 PST 3: Retry notify CouchDB of update, vbucket=473 rev=2
This message appears in the log before the memcached process spiked to 100, so seems like it isn't related. I have also tried blowing away /opt/couchbase and reinstalling, but that does not help either, after a few hours the cpu spike returns on the node. I also tried deleting the memcached buckets and recreating them, but cpu usage did not change.
Messages like this appear in our couchbase error.1 log:
[ns_server:error,2013-01-07T4:28:14.279,ns_1@10.8.2.121:<0.1136.1>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 681402 us
[ns_server:error,2013-01-07T4:36:04.171,ns_1@10.8.2.121:<0.883.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 615103 us
[ns_server:error,2013-01-07T5:47:59.007,ns_1@10.8.2.121:<0.14986.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 529394 us
[ns_server:error,2013-01-07T5:48:34.107,ns_1@10.8.2.121:<0.1137.1>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 612580 us
[ns_server:error,2013-01-07T5:58:33.711,ns_1@10.8.2.121:<0.864.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 539101 us
[ns_server:error,2013-01-07T6:08:23.995,ns_1@10.8.2.121:<0.963.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 637879 us
[ns_server:error,2013-01-07T6:08:59.025,ns_1@10.8.2.121:<0.1134.1>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 575802 us
[error_logger:error,2013-01-07T6:28:50.271,ns_1@10.8.2.121:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]Mnesia('ns_1@10.8.2.121'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
[ns_server:error,2013-01-07T6:43:54.046,ns_1@10.8.2.121:<0.1137.1>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 597171 us
[ns_server:error,2013-01-07T6:55:18.806,ns_1@10.8.2.121:<0.884.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 536158 us
[error_logger:error,2013-01-07T7:16:50.252,ns_1@10.8.2.121:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]Mnesia('ns_1@10.8.2.121'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
[ns_server:error,2013-01-07T7:17:33.664,ns_1@10.8.2.121:<0.863.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 511681 us
[ns_server:error,2013-01-07T7:18:04.052,ns_1@10.8.2.121:<0.1137.1>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 622107 us
[ns_server:error,2013-01-07T7:39:08.925,ns_1@10.8.2.121:<0.963.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 538549 us
[ns_server:error,2013-01-07T8:38:39.410,ns_1@10.8.2.121:<0.1137.1>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 580306 us
[ns_server:error,2013-01-07T8:42:13.786,ns_1@10.8.2.121:<0.14986.0>:ns_memcached:verify_report_long_call:297]call {stats,<<>>} took too long: 603351 us
[ns_server:error,2013-01-07T8:42:13.945,ns_1@10.8.2.121:<0.863.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 785144 us
[ns_server:error,2013-01-07T8:42:44.368,ns_1@10.8.2.121:<0.863.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 1216497 us
[ns_server:error,2013-01-07T8:48:10.947,ns_1@10.8.2.121:ns_memcached-prices<0.859.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 542696 us
As with the memcached logs though, these messages appear before the cpu spikes too, so I'm not sure they are related...
I would also add that we've been running 1.7.2 for a year before and this only started with the 2.0 update, so I believe it is due to something in the update as we have not made any changes to our app or underlying data recently.
Thanks,
Chris