Node down, impossible to get it back up
Hi all,
We have run into an issue overnight.
First off, our config is :
- We only have 1 node (4GB of RAM) and 1 bucket ('default' bucket)
- We run Couchbase 2.0 Dev Preview 3
- The server is homed in a large Amazon EC2 instance (2 virtuals cores)
In our Couchbase Manager interface (localhost:8091, obtained after a "/etc/init.d/couchbase-server start" command), our active server is "down" and we can't understand why we can't access to the data in the bucket.
The logs that were obtained from : var/lib/couchbase/logs/errors :
stats:error] [2011-12-23 10:17:51] [ns_1@127.0.0.1:<0.1900.9>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@127.0.0.1']
[error_logger:error] [2011-12-23 10:17:52] [ns_1@127.0.0.1:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,menelaus_sup}
Context: child_terminated
Reason: {noproc,{gen_server,call,
['ns_memcached-default',topkeys,30000]}}
Offender: [{pid,<0.1599.9>},
{name,hot_keys_keeper},
{mfargs,{hot_keys_keeper,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
[ns_server:error] [2011-12-23 10:17:58] [ns_1@127.0.0.1:<0.7176.9>:ns_janitor:do_cleanup:68] Bucket "default" not yet ready on ['ns_1@127.0.0.1']
[stats:error] [2011-12-23 10:17:57] [ns_1@127.0.0.1:<0.1900.9>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@127.0.0.1']
[error_logger:error] [2011-12-23 10:18:02] [ns_1@127.0.0.1:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,menelaus_sup}
Context: child_terminated
Reason: {noproc,{gen_server,call,
['ns_memcached-default',topkeys,30000]}}
Offender: [{pid,<0.7194.9>},
{name,hot_keys_keeper},
{mfargs,{hot_keys_keeper,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
[stats:error] [2011-12-23 10:18:00] [ns_1@127.0.0.1:<0.18155.5>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@127.0.0.1']
[ns_server:error] [2011-12-23 10:18:07] [ns_1@127.0.0.1:xdc_rep_manager:xdc_rep_manager:handle_info:263] database update notifier died. Reason: killed
[error_logger:error] [2011-12-23 10:18:07] [ns_1@127.0.0.1:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.16293.9> on node 'ns_1@127.0.0.1' with exit value: {{badmatch,{ok,<<295271 bytes>>}},[{couch_file,read_raw_iolist_int,3},{couch_file,maybe_read_more_iolist,4},{couch_file,reader_loop,1}]}
[stats:error] [2011-12-23 10:18:05] [ns_1@127.0.0.1:<0.7201.9>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@127.0.0.1']
[ns_server:error] [2011-12-23 10:18:16] [ns_1@127.0.0.1:<0.16329.9>:ns_janitor:do_cleanup:68] Bucket "default" not yet ready on ['ns_1@127.0.0.1']
[stats:error] [2011-12-23 10:18:10] [ns_1@127.0.0.1:<0.10559.9>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@127.0.0.1']
[error_logger:error] [2011-12-23 10:18:17] [ns_1@127.0.0.1:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server xdc_rep_manager terminating
** Last message in was {'EXIT',<0.1625.9>,killed}
** When Server state == {rep_db_state,<0.1624.9>,<<"_replicator">>,<0.1625.9>}** Reason for termination ==
** {timeout,
{gen_server,call,
['logger-ns_server',
{log,
{log_info,ns_server,error,xdc_rep_manager,handle_info,263,
{1324,635487,988595},
xdc_rep_manager,'ns_1@127.0.0.1',undefined},
"database update notifier died. Reason: ~p",
[killed]}]}} When trying to access a document that we know exists in the database (with key "list_sessions"), the following error is brought up :
Memcached::ProtocolError: Key {"fb:50"=>"127.0.0.1:11211:8"}
from /usr/local/ruby/lib/ruby/gems/1.9.1/gems/memcached-1.3.5/lib/memcached/memcached.rb:613:in `reraise'
from /usr/local/ruby/lib/ruby/gems/1.9.1/gems/memcached-1.3.5/lib/memcached/memcached.rb:593:in `check_return_code'
from /usr/local/ruby/lib/ruby/gems/1.9.1/gems/memcached-1.3.5/lib/memcached/memcached.rb:512:in `get'
from /usr/local/ruby/lib/ruby/gems/1.9.1/gems/couchbase-0.9.7/lib/couchbase/memcached.rb:130:in `get'
from /home/funkytown/funkytown/app/models/user.rb:68:in `find'
from (irb):11
from /usr/local/ruby/lib/ruby/gems/1.9.1/gems/railties-3.1.3/lib/rails/commands/console.rb:45:in `start'
from /usr/local/ruby/lib/ruby/gems/1.9.1/gems/railties-3.1.3/lib/rails/commands/console.rb:8:in `start'
from /usr/local/ruby/lib/ruby/gems/1.9.1/gems/railties-3.1.3/lib/rails/commands.rb:40:in `<top (required)>'
from script/rails:6:in `require'
from script/rails:6:in `<main>'The problem seems to have occurred last night, time from which the logs start to look odd. On a normal basis, the logs of the couchbase interface often show a series of shutting downs of the bucket followed by loadings of the same bucket. This time however, only shutting downs take place as you can see :
2011-12-22 20:09:49.964 ns_cookie_manager:2:info:cookie update(ns_1@127.0.0.1) - Node 'ns_1@127.0.0.1' synchronized otp cookie jqktjyrvlnxeahky from cluster 2011-12-22 20:09:50.034 menelaus_sup:1:info:web start ok(ns_1@127.0.0.1) - Couchbase Server has started on web port 8091 on node 'ns_1@127.0.0.1'. 2011-12-22 20:09:55.629 ns_memcached:2:info:message(ns_1@127.0.0.1) - Shutting down bucket "default" on 'ns_1@127.0.0.1' for server shutdown 2011-12-22 20:15:50.917 ns_memcached:2:info:message(ns_1@127.0.0.1) - Shutting down bucket "default" on 'ns_1@127.0.0.1' for server shutdown (repeated 2 times) 2011-12-22 20:15:58.915 ns_memcached:2:info:message(ns_1@127.0.0.1) - Shutting down bucket "default" on 'ns_1@127.0.0.1' for server shutdown 2011-12-22 20:21:49.951 ns_memcached:2:info:message(ns_1@127.0.0.1) - Shutting down bucket "default" on 'ns_1@127.0.0.1' for server shutdown (repeated 3 times) 2011-12-22 20:22:49.325 ns_memcached:2:info:message(ns_1@127.0.0.1) - Shutting down bucket "default" on 'ns_1@127.0.0.1' for server shutdown 2011-12-22 20:27:49.956 ns_memcached:2:info:message(ns_1@127.0.0.1) - Shutting down bucket "default" on 'ns_1@127.0.0.1' for server shutdown (repeated 1 times)
Do you have an idea on why this suddenly happened ?
Any help you could give us would be very much appreciated. We would like to know where this could come from, seeing as we intend on using the service for our casual game application, and reinstalling Couchbase everytime is not a viable option.
Thank you for your attention,
Vincent
First off, thanks so much for the detailed report.
I don't immediately know of any issues that would cause this problem. Anything that is causing a protocol violation (as seen from the ruby client) is definitely an issue we want to look into. I've just filed:
http://www.couchbase.org/issues/browse/MB-4600
Is there any chance you can get a packet capture of port 11211, port 11210, port 8091 when it's in a failed state while you try to make the same request from the Ruby client? This may give us some clues as to what's happening at the protocol level. You should be able to do this with tcpdump.
If you can, please attach it to that issue above.