Process Crash Loop.
Need some help with logfile analysis. Looks like communications issues of some sort are causing processes to crash and restart constantly. Lots of the following bits in logs on all nodes:
ERROR REPORT <5597.5916.936> 2011-04-22 13:09:44
===============================================================================
ns_1@10.101.5.181:ns_memcached:374: Unable to connect: {error,
{badmatch,
{error,
econnrefused}}}, retrying.This is repeated a bajillion times, then:
ERROR REPORT <5597.5815.936> 2011-04-22 13:09:44
===============================================================================
** Generic server <5597.5815.936> terminating
** Last message in was {#Port<5597.28202070>,{exit_status,134}}
** When Server state == {state,#Port<5597.28202070>,memcached,
{["memcached: stored-value.hh:974: add_type_t HashTable::add(const Item&, bool, bool): Assertion `v->isDirty() == isDirty' failed.",
"Item was expired at load: 530734zMCEeNfJs85z"],
["Item was expired at load: 530734klPJrGH0Hsjc"]},
{ok,{1303492184891651,
#Ref<5597.0.1973.200420>}},
["Item was expired at load: 533406XCeVuIJBMPNF",
"Item was expired at load: 533406twnwhPE4OhiC",
"Item was expired at load: 5313992tePTLISLPLM",
"Item was expired at load: 531399LyWwvVzKuYw2",
"Item was expired at load: 533406Q08ygjQkH6vN",
"Item was expired at load: 533406cZj1hVJqXOeB",
"Item was expired at load: 533406QpeJ8WaU7vM2",
"Item was expired at load: 533406fuRZwT6zPiHM",
"Item was expired at load: 533406EODkYgy4ipxC",
"Item was expired at load: 531399c45mzSAJcPD3"],
2394}
** Reason for termination ==
** {abnormal,134}
CRASH REPORT <5597.5815.936> 2011-04-22 13:09:44
===============================================================================
Crashing process
initial_call {ns_port_server,init,['Argument__1']}
pid <5597.5815.936>
registered_name []
error_info
{exit,{abnormal,134},
[{gen_server,terminate,6},{proc_lib,init_p_do_apply,3}]}
ancestors
[<5597.5814.936>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
<5597.52.0>]
messages [{'EXIT',#Port<5597.28202070>,normal}]
links [<5597.5814.936>]
dictionary []
trap_exit true
status running
heap_size 6765
stack_size 24
reductions 440446
INFO REPORT <5597.5814.936> 2011-04-22 13:09:44
===============================================================================
Cushion managed supervisor for memcached failed: {abnormal,134}
ERROR REPORT <5597.5814.936> 2011-04-22 13:09:44
===============================================================================
** Generic server <5597.5814.936> terminating
** Last message in was {die,{error,cushioned_supervisor,{abnormal,134}}}
** When Server state == {state,memcached,5000,{1303,492177,453890},undefined}
** Reason for termination ==
** {error,cushioned_supervisor,{abnormal,134}}
CRASH REPORT <5597.5814.936> 2011-04-22 13:09:44
===============================================================================
Crashing process
initial_call {supervisor_cushion,init,['Argument__1']}
pid <5597.5814.936>
registered_name []
error_info
{exit,{error,cushioned_supervisor,{abnormal,134}},
[{gen_server,terminate,6},{proc_lib,init_p_do_apply,3}]}
ancestors [ns_port_sup,ns_server_sup,ns_server_cluster_sup,<5597.52.0>]
messages []
links [<5597.106.0>]
dictionary []
trap_exit true
status running
heap_size 377
stack_size 24
reductions 216Then we get scads of supervisor reports. Can I send you some logs, Perry?
1.6.5.
Time for an upgrade?
Indeed.
While you're at it, take a look at our 1.7 pre-release and let me know what you think: http://techzone.couchbase.com/forums/thread/membase-server-17-developer-...
Perry
Perry,
Seeing the below error every time I try to remove/rebalance a node:
INFO REPORT <5597.28355.1061> 2011-04-25 09:21:00
===============================================================================
ns_1@10.101.5.181:ns_rebalancer:420: Waiting for ['ns_1@10.101.5.182',
'ns_1@10.101.5.183',
'ns_1@10.101.5.184',
'ns_1@10.101.5.185',
'ns_1@10.101.5.186']
[previous message repeated every second]
INFO REPORT <5597.157.0> 2011-04-25 09:21:07
===============================================================================
ns_log: logging ns_orchestrator:2:Rebalance exited with reason wait_for_memcached_failedLooks like a simple timeout, but it's unshakeable. Hard to upgrade the cluster when I can't remove nodes. Any assistance you could lend would be appreciated.
If one node is continuously restarting (as per your previous error message) you won't be able to rebalance it out of the cluster since we can't pull the necessary data off of it.
You're best option would be to fail that node over, upgrade it and add it back to the cluster.
If there are more nodes crashing than you have replicas available, you'll have to do an "in place" upgrade which means shutting all the nodes down, upgrading them and restarting.
Make sense? Make sure to check out the release notes and upgrade instructions for 1.6.5.3: http://techzone.couchbase.com/wiki/display/membase/Membase+Server+1.6.5.3
Perry
This line points to a bug fixed in 1.6.5.3:
{["memcached: stored-value.hh:974: add_type_t HashTable::add(const Item&, bool, bool): Assertion `v->isDirty() == isDirty' failed.",
What version are you running?
Perry
Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!