Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 2.0-developer-preview-4
-
Fix Version/s: 2.0-beta
-
Component/s: view-engine
-
Security Level: Public
-
Labels:None
-
Environment:
Description
test case :
./testrunner -i resources/jenkins/centos-64-7node-failover.ini -t viewtests.ViewFailoverTests.test_view_failover_multiple_design_docs_x_node_replica_y,num-design-docs=20,num-docs=10000,replica=2,failover-factor=2
create 20 design docs and insert 200k docs in the cluster , then failover two nodes and rebalance those failed over nodes out of the cluster
./testrunner -i resources/jenkins/centos-64-7node-failover.ini -t viewtests.ViewFailoverTests.test_view_failover_multiple_design_docs_x_node_replica_y,num-design-docs=20,num-docs=10000,replica=2,failover-factor=2
create 20 design docs and insert 200k docs in the cluster , then failover two nodes and rebalance those failed over nodes out of the cluster
-
- 10.3.121.155-8091-diag.txt.gz
- 29/Apr/12 11:05 PM
- 14.87 MB
- Farshid Ghods
-
- 10.3.121.156-8091-diag.txt.gz
- 29/Apr/12 11:05 PM
- 13.84 MB
- Farshid Ghods
-
- 10.3.121.159-8091-diag.txt.gz
- 29/Apr/12 11:05 PM
- 15.09 MB
- Farshid Ghods
-
- 10.3.121.161-8091-diag.txt.gz
- 29/Apr/12 11:05 PM
- 13.86 MB
- Farshid Ghods
-
Hide
- ns-diag-20120501135727.txt.zip
- 01/May/12 6:11 PM
- 16.08 MB
- Farshid Ghods
-
- ns-diag-20120501135727.txt 233.70 MB
- __MACOSX/._ns-diag-20120501135727.txt 0.3 kB
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Permalink
Farshid Ghods
added a comment -
cluster is live here : http://10.3.121.155:8091/index.html#sec=views ( will leave it running for now until the test times out )
Show
Farshid Ghods
added a comment - cluster is live here : http://10.3.121.155:8091/index.html#sec=views ( will leave it running for now until the test times out )
Hide
Farshid Ghods
added a comment -
Starting rebalance, KeepNodes = ['ns_1@10.3.121.156','ns_1@10.3.121.161',
'ns_1@10.3.121.159','ns_1@10.3.121.155'], EjectNodes = []
ns_orchestrator004 ns_1@10.3.121.155 00:11:46 - Sat Apr 28, 2012
Failed over 'ns_1@10.3.121.157': ok ns_orchestrator006 ns_1@10.3.121.155 00:11:36 - Sat Apr 28, 2012
Starting failing over 'ns_1@10.3.121.157' ns_orchestrator000 ns_1@10.3.121.155 00:11:33 - Sat Apr 28, 2012
Failed over 'ns_1@10.3.121.158': ok ns_orchestrator006 ns_1@10.3.121.155 00:11:33 - Sat Apr 28, 2012
Starting failing over 'ns_1@10.3.121.158'
'ns_1@10.3.121.159','ns_1@10.3.121.155'], EjectNodes = []
ns_orchestrator004 ns_1@10.3.121.155 00:11:46 - Sat Apr 28, 2012
Failed over 'ns_1@10.3.121.157': ok ns_orchestrator006 ns_1@10.3.121.155 00:11:36 - Sat Apr 28, 2012
Starting failing over 'ns_1@10.3.121.157' ns_orchestrator000 ns_1@10.3.121.155 00:11:33 - Sat Apr 28, 2012
Failed over 'ns_1@10.3.121.158': ok ns_orchestrator006 ns_1@10.3.121.155 00:11:33 - Sat Apr 28, 2012
Starting failing over 'ns_1@10.3.121.158'
Show
Farshid Ghods
added a comment - Starting rebalance, KeepNodes = [' ns_1@10.3.121.156 ',' ns_1@10.3.121.161 ',
' ns_1@10.3.121.159 ',' ns_1@10.3.121.155 '], EjectNodes = []
ns_orchestrator004 ns_1@10.3.121.155 00:11:46 - Sat Apr 28, 2012
Failed over ' ns_1@10.3.121.157 ': ok ns_orchestrator006 ns_1@10.3.121.155 00:11:36 - Sat Apr 28, 2012
Starting failing over ' ns_1@10.3.121.157 ' ns_orchestrator000 ns_1@10.3.121.155 00:11:33 - Sat Apr 28, 2012
Failed over ' ns_1@10.3.121.158 ': ok ns_orchestrator006 ns_1@10.3.121.155 00:11:33 - Sat Apr 28, 2012
Starting failing over ' ns_1@10.3.121.158 '
Hide
Farshid Ghods
added a comment -
seeing this issue in a test which starts/stop rebalance multiple times when rebalancing 1->2 nodes.
Show
Farshid Ghods
added a comment - seeing this issue in a test which starts/stop rebalance multiple times when rebalancing 1->2 nodes.
Hide
Aleksey Kondratenko
added a comment -
Everything is stuck waiting on this guy (on node .159)
{<0.5209.19>,
[{registered_name,[]},
{status,waiting},
{initial_call,{proc_lib,init_p,5}},
{backtrace,
[<<"Program counter: 0x00002aaab415e658 (couch_set_view_group:stop_cleaner/1 + 576)">>,
<<"CP: 0x0000000000000000 (invalid)">>,
<<"arity = 0">>,<<>>,
<<"0x00002aaac9d28e58 Return addr 0x00002aaab4154608 (couch_set_view_group:terminate/2 + 680)">>,
<<"y(0) []">>,<<"y(1) []">>,
<<"y(2) []">>,<<"y(3) []">>,
<<"y(4) []">>,
<<"(5) {state,{\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 ">>,
<<"(6) {set_view_group,<<16 bytes>>,<0.5212.19>,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],">>,
<<"(7) {\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 bytes>>">>,
<<"y(8) <0.22547.19>">>,<<>>,
<<"0x00002aaac9d28ea8 Return addr 0x00002b12258a4f08 (gen_server:terminate/6 + 184)">>,
<<"y(0) []">>,<<"y(1) []">>,
<<"y(2) []">>,
<<"y(3) {noproc,{gen_server,call,[<0.5220.19>,{set_state,\"\\\"#$&()*,\",[],\"'\"},infinity]}}">>,
<<>>,
<<"0x00002aaac9d28ed0 Return addr 0x00002b122584c458 (proc_lib:init_p_do_apply/3 + 56)">>,
<<"y(0) []">>,
<<"(1) {state,{\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 ">>,
<<"y(2) couch_set_view_group">>,
<<"(3) {set_state,[34,35,36,38,40,41,42,44,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,6">>,
<<"y(4) <0.5209.19>">>,
<<"y(5) {noproc,{gen_server,call,[<0.5220.19>,{set_state,\"\\\"#$&()*,\",[],\"'\"},infinity]}}">>,
<<"y(6) Catch 0x00002b12258a4f08 (gen_server:terminate/6 + 184)">>,
<<>>,
<<"0x00002aaac9d28f10 Return addr 0x00000000008a4a38 (<terminate process normally>)">>,
<<"y(0) Catch 0x00002b122584c478 (proc_lib:init_p_do_apply/3 + 88)">>,
<<>>]},
{error_handler,error_handler},
{garbage_collection,
[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,16}]},
{heap_size,6765},
{total_heap_size,35422},
{links,[<0.222.0>]},
{memory,288776},
{message_queue_len,45},
{reductions,4538890},
{trap_exit,true}]},
capi_set_view_manager is stuck on defining index. but couch_set_view process is stuck doing partition_deleted call to that stuck guy.
Also there's a bunch of queries being stuck on this view group as well. See processes list in diag of .159
{<0.5209.19>,
[{registered_name,[]},
{status,waiting},
{initial_call,{proc_lib,init_p,5}},
{backtrace,
[<<"Program counter: 0x00002aaab415e658 (couch_set_view_group:stop_cleaner/1 + 576)">>,
<<"CP: 0x0000000000000000 (invalid)">>,
<<"arity = 0">>,<<>>,
<<"0x00002aaac9d28e58 Return addr 0x00002aaab4154608 (couch_set_view_group:terminate/2 + 680)">>,
<<"y(0) []">>,<<"y(1) []">>,
<<"y(2) []">>,<<"y(3) []">>,
<<"y(4) []">>,
<<"(5) {state,{\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 ">>,
<<"(6) {set_view_group,<<16 bytes>>,<0.5212.19>,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],">>,
<<"(7) {\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 bytes>>">>,
<<"y(8) <0.22547.19>">>,<<>>,
<<"0x00002aaac9d28ea8 Return addr 0x00002b12258a4f08 (gen_server:terminate/6 + 184)">>,
<<"y(0) []">>,<<"y(1) []">>,
<<"y(2) []">>,
<<"y(3) {noproc,{gen_server,call,[<0.5220.19>,{set_state,\"\\\"#$&()*,\",[],\"'\"},infinity]}}">>,
<<>>,
<<"0x00002aaac9d28ed0 Return addr 0x00002b122584c458 (proc_lib:init_p_do_apply/3 + 56)">>,
<<"y(0) []">>,
<<"(1) {state,{\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 ">>,
<<"y(2) couch_set_view_group">>,
<<"(3) {set_state,[34,35,36,38,40,41,42,44,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,6">>,
<<"y(4) <0.5209.19>">>,
<<"y(5) {noproc,{gen_server,call,[<0.5220.19>,{set_state,\"\\\"#$&()*,\",[],\"'\"},infinity]}}">>,
<<"y(6) Catch 0x00002b12258a4f08 (gen_server:terminate/6 + 184)">>,
<<>>,
<<"0x00002aaac9d28f10 Return addr 0x00000000008a4a38 (<terminate process normally>)">>,
<<"y(0) Catch 0x00002b122584c478 (proc_lib:init_p_do_apply/3 + 88)">>,
<<>>]},
{error_handler,error_handler},
{garbage_collection,
[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,16}]},
{heap_size,6765},
{total_heap_size,35422},
{links,[<0.222.0>]},
{memory,288776},
{message_queue_len,45},
{reductions,4538890},
{trap_exit,true}]},
capi_set_view_manager is stuck on defining index. but couch_set_view process is stuck doing partition_deleted call to that stuck guy.
Also there's a bunch of queries being stuck on this view group as well. See processes list in diag of .159
Show
Aleksey Kondratenko
added a comment - Everything is stuck waiting on this guy (on node .159)
{<0.5209.19>,
[{registered_name,[]},
{status,waiting},
{initial_call,{proc_lib,init_p,5}},
{backtrace,
[<<"Program counter: 0x00002aaab415e658 (couch_set_view_group:stop_cleaner/1 + 576)">>,
<<"CP: 0x0000000000000000 (invalid)">>,
<<"arity = 0">>,<<>>,
<<"0x00002aaac9d28e58 Return addr 0x00002aaab4154608 (couch_set_view_group:terminate/2 + 680)">>,
<<"y(0) []">>,<<"y(1) []">>,
<<"y(2) []">>,<<"y(3) []">>,
<<"y(4) []">>,
<<"(5) {state,{\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 ">>,
<<"(6) {set_view_group,<<16 bytes>>,<0.5212.19>,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],">>,
<<"(7) {\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 bytes>>">>,
<<"y(8) <0.22547.19>">>,<<>>,
<<"0x00002aaac9d28ea8 Return addr 0x00002b12258a4f08 (gen_server:terminate/6 + 184)">>,
<<"y(0) []">>,<<"y(1) []">>,
<<"y(2) []">>,
<<"y(3) {noproc,{gen_server,call,[<0.5220.19>,{set_state,\"\\\"#$&()*,\",[],\"'\"},infinity]}}">>,
<<>>,
<<"0x00002aaac9d28ed0 Return addr 0x00002b122584c458 (proc_lib:init_p_do_apply/3 + 56)">>,
<<"y(0) []">>,
<<"(1) {state,{\"/opt/couchbase/var/lib/couchbase/data/\",<<7 bytes>>,{set_view_group,<<16 ">>,
<<"y(2) couch_set_view_group">>,
<<"(3) {set_state,[34,35,36,38,40,41,42,44,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,6">>,
<<"y(4) <0.5209.19>">>,
<<"y(5) {noproc,{gen_server,call,[<0.5220.19>,{set_state,\"\\\"#$&()*,\",[],\"'\"},infinity]}}">>,
<<"y(6) Catch 0x00002b12258a4f08 (gen_server:terminate/6 + 184)">>,
<<>>,
<<"0x00002aaac9d28f10 Return addr 0x00000000008a4a38 (<terminate process normally>)">>,
<<"y(0) Catch 0x00002b122584c478 (proc_lib:init_p_do_apply/3 + 88)">>,
<<>>]},
{error_handler,error_handler},
{garbage_collection,
[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,16}]},
{heap_size,6765},
{total_heap_size,35422},
{links,[<0.222.0>]},
{memory,288776},
{message_queue_len,45},
{reductions,4538890},
{trap_exit,true}]},
capi_set_view_manager is stuck on defining index. but couch_set_view process is stuck doing partition_deleted call to that stuck guy.
Also there's a bunch of queries being stuck on this view group as well. See processes list in diag of .159
Hide
Aleksey Kondratenko
added a comment -
my guess is this guy is waiting on cleaner process that's actually dead. Shutdown reason is noproc from gen_call, perhaps to cleaner. And it's waiting on it's EXIT signal in not 100% robust way (i'd monitor this guy in addition to waiting EXIT. At least in terminate).
Another weird thing is that there are 0 (!) traces of this cleaner dying in logs somehow.
Another weird thing is that there are 0 (!) traces of this cleaner dying in logs somehow.
Show
Aleksey Kondratenko
added a comment - my guess is this guy is waiting on cleaner process that's actually dead. Shutdown reason is noproc from gen_call, perhaps to cleaner. And it's waiting on it's EXIT signal in not 100% robust way (i'd monitor this guy in addition to waiting EXIT. At least in terminate).
Another weird thing is that there are 0 (!) traces of this cleaner dying in logs somehow.
Hide
Filipe Manana
added a comment -
Thanks for the analysis.
The problem is that recently I made some change to make dialyzer happy and removed an exit() call inside the cleaner process, which is now exiting with reason normal, which is ignored by the gen_server.
And dealing with EXITs is 100% robust and long as a spawn_link was used to spawn the process.
The problem is that recently I made some change to make dialyzer happy and removed an exit() call inside the cleaner process, which is now exiting with reason normal, which is ignored by the gen_server.
And dealing with EXITs is 100% robust and long as a spawn_link was used to spawn the process.
Show
Filipe Manana
added a comment - Thanks for the analysis.
The problem is that recently I made some change to make dialyzer happy and removed an exit() call inside the cleaner process, which is now exiting with reason normal, which is ignored by the gen_server.
And dealing with EXITs is 100% robust and long as a spawn_link was used to spawn the process.
Hide
Aleksey Kondratenko
added a comment -
You're still fundamentally missing my point (as also seen in your reply in couch_file thread). You tend to trust your code IMHO a little bit too much. A bit of extra robustness is imho needed. There will always be bugs here and there and there needs to be some sensible way to a) detect and propagate them easily b) preferably not misbehave too badly. This case only proves my point.
Show
Aleksey Kondratenko
added a comment - You're still fundamentally missing my point (as also seen in your reply in couch_file thread). You tend to trust your code IMHO a little bit too much. A bit of extra robustness is imho needed. There will always be bugs here and there and there needs to be some sensible way to a) detect and propagate them easily b) preferably not misbehave too badly. This case only proves my point.
Hide
Filipe Manana
added a comment -
Fine.
Can you tell me why it's not reliable to trap EXIT signals inside terminate (when the gen_server is trapping exits and spawned the process with spawn_link)?
I don't see that documented anywhere, and neither it seems to the behaviour by looking at OTP's source.
Also, I don't see the following log message in terminate in the logs:
https://github.com/couchbase/couchdb/blob/master/src/couch_set_view/src/couch_set_view_group.erl#L1840
Can you explain me that as well?
Can you tell me why it's not reliable to trap EXIT signals inside terminate (when the gen_server is trapping exits and spawned the process with spawn_link)?
I don't see that documented anywhere, and neither it seems to the behaviour by looking at OTP's source.
Also, I don't see the following log message in terminate in the logs:
https://github.com/couchbase/couchdb/blob/master/src/couch_set_view/src/couch_set_view_group.erl#L1840
Can you explain me that as well?
Show
Filipe Manana
added a comment - Fine.
Can you tell me why it's not reliable to trap EXIT signals inside terminate (when the gen_server is trapping exits and spawned the process with spawn_link)?
I don't see that documented anywhere, and neither it seems to the behaviour by looking at OTP's source.
Also, I don't see the following log message in terminate in the logs:
https://github.com/couchbase/couchdb/blob/master/src/couch_set_view/src/couch_set_view_group.erl#L1840
Can you explain me that as well?
Hide
Aleksey Kondratenko
added a comment -
Rebalance was started (and got stuck) more than day before logs were grabbed. So logs have rotated past this log messages you're seeking.
About trapping exit and things. It is reliable, _but_ you don't really know in what condition (after some bug or crash or whatever) your terminate function will be called. Assuming everything is ok (i.e. nobody consumed your EXIT message) is a bit too much imho. So my point is: certain places need to be more robust and should not make too strong assumptions.
About trapping exit and things. It is reliable, _but_ you don't really know in what condition (after some bug or crash or whatever) your terminate function will be called. Assuming everything is ok (i.e. nobody consumed your EXIT message) is a bit too much imho. So my point is: certain places need to be more robust and should not make too strong assumptions.
Show
Aleksey Kondratenko
added a comment - Rebalance was started (and got stuck) more than day before logs were grabbed. So logs have rotated past this log messages you're seeking.
About trapping exit and things. It is reliable, _but_ you don't really know in what condition (after some bug or crash or whatever) your terminate function will be called. Assuming everything is ok (i.e. nobody consumed your EXIT message) is a bit too much imho. So my point is: certain places need to be more robust and should not make too strong assumptions.
Hide
Filipe Manana
added a comment -
If there's something consuming the exit message before terminate is called, then there's a bug somewhere.
I rather prefer to find out exactly what's happening rather than being to defensive and mask bugs.
I rather prefer to find out exactly what's happening rather than being to defensive and mask bugs.
Show
Filipe Manana
added a comment - If there's something consuming the exit message before terminate is called, then there's a bug somewhere.
I rather prefer to find out exactly what's happening rather than being to defensive and mask bugs.
Hide
Filipe Manana
added a comment -
Farshid, can this be retried with more recent builds?
Was this happening with a specific testrunner case and I can try?
Was this happening with a specific testrunner case and I can try?
Show
Filipe Manana
added a comment - Farshid, can this be retried with more recent builds?
Was this happening with a specific testrunner case and I can try?
Hide
Filipe Manana
added a comment -
Farshid, never mind, I found out why it happens. There's a change in gerrit already.
Show
Filipe Manana
added a comment - Farshid, never mind, I found out why it happens. There's a change in gerrit already.
Hide
Farshid Ghods
added a comment -
Keith,
can you let Filipe know which test causes this issue if you notice this on the jenkins test runs ?
can you let Filipe know which test causes this issue if you notice this on the jenkins test runs ?
Show
Farshid Ghods
added a comment - Keith,
can you let Filipe know which test causes this issue if you notice this on the jenkins test runs ?
Hide
Thuan Nguyen
added a comment -
Integrated in github-couchdb-preview #397 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/397/])
MB-5189 Fix deadlock on view group shutdown (Revision 2188a7b1b5483a145cba710b37963e10b815cb6c)
Result = SUCCESS
Filipe David Borba Manana :
Files :
* src/couch_set_view/src/couch_set_view_group.erl
Result = SUCCESS
Filipe David Borba Manana :
Files :
* src/couch_set_view/src/couch_set_view_group.erl
Show
Thuan Nguyen
added a comment - Integrated in github-couchdb-preview #397 (See [ http://qa.hq.northscale.net/job/github-couchdb-preview/397/ ])
MB-5189 Fix deadlock on view group shutdown (Revision 2188a7b1b5483a145cba710b37963e10b815cb6c)
Result = SUCCESS
Filipe David Borba Manana :
Files :
* src/couch_set_view/src/couch_set_view_group.erl
Show
Farshid Ghods
added a comment - Keith will verify whether the test is passing now