[MB-4732]  Compaction seems to be stuck (or not running) Created: 31/Jan/12  Updated: 10/Apr/12  Due: 31/Jan/12  Resolved: 01/Mar/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0-developer-preview-4
Fix Version/s: 2.0-developer-preview-4
Security Level: Public

Type: Bug Priority: Major
Reporter: Tommie McAfee Assignee: Tommie McAfee
Resolution: Fixed Votes: 0
Labels: 2.0-dev-preview-4-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: I have an 8 node cluster with disk size much larger then data.

Attachments: Zip Archive ns-diag-20120131214354.zip    

 Description   
Quoting Sharon's:

"Troubleshooting, I found many nodes where disk size was 4 times greater then on other nodes.
 
Looking at one of these nodes where data is not compacted,
Compaction seems to be stuck.
 
http://50.18.98.4:8092/default%2F101
{"db_name":"default/101","doc_count":1807,"doc_del_count":0,"update_seq":2986,"purge_seq":0,"compact_running":false,"disk_size":4452469,"data_size":922673,"instance_start_time":"1328040896372522","disk_format_version":7,"committed_update_seq":2985}
 
Cluster is at http://50.18.98.4:8091 (Administrator/password)"


> Quoting Allaksey
The cause of compaction daemon hang is the same as of views hangs. So
generally this is the same bug.






 Comments   
Comment by Aliaksey Artamonau [ 31/Jan/12 ]
Compaction daemon processes' backtraces:

{<0.5083.0>,
                      [{registered_name,[]},
                       {status,waiting},
                       {initial_call,{proc_lib,init_p,5}},
                       {backtrace,
                        [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>,
                         <<"CP: 0x0000000000000000 (invalid)">>,
                         <<"arity = 0">>,<<>>,
                         <<"0x00002aaaad443728 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>,
                         <<"y(0) []">>,<<"y(1) infinity">>,
                         <<"y(2) supervisor_cushion">>,
                         <<"y(3) {state,couchbase_compaction_daemon,3000,{1328,40081,401252},<0.5084.0>}">>,
                         <<"y(4) <0.5083.0>">>,<<"y(5) <0.4997.0>">>,
                         <<>>,
                         <<"0x00002aaaad443760 Return addr 0x000000000088e318 (<terminate process normally>)">>,
                         <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>,
                         <<>>]},
                       {error_handler,error_handler},
                       {garbage_collection,
                        [{min_bin_vheap_size,46368},
                         {min_heap_size,233},
                         {fullsweep_after,0},
                         {minor_gcs,0}]},
                       {heap_size,233},
                       {total_heap_size,233},
                       {links,[<0.4997.0>,<0.5084.0>]},
                       {memory,2840},
                       {message_queue_len,0},
                       {reductions,75},
                       {trap_exit,true}]},
                     {<0.5084.0>,
                      [{registered_name,couchbase_compaction_daemon},
                       {status,waiting},
                       {initial_call,{proc_lib,init_p,5}},
                       {backtrace,
                        [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>,
                         <<"CP: 0x0000000000000000 (invalid)">>,
                         <<"arity = 0">>,<<>>,
                         <<"0x00002aaabe24c5f8 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>,
                         <<"y(0) []">>,<<"y(1) infinity">>,
                         <<"y(2) couchbase_compaction_daemon">>,
                         <<"y(3) {state,<0.5085.0>}">>,
                         <<"y(4) couchbase_compaction_daemon">>,
                         <<"y(5) <0.5083.0>">>,<<>>,
                         <<"0x00002aaabe24c630 Return addr 0x000000000088e318 (<terminate process normally>)">>,
                         <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>,
                         <<>>]},
                       {error_handler,error_handler},
                       {garbage_collection,
                        [{min_bin_vheap_size,46368},
                         {min_heap_size,233},
                         {fullsweep_after,0},
                         {minor_gcs,0}]},
                       {heap_size,987},
                       {total_heap_size,987},
                       {links,[<0.5083.0>,<0.5085.0>]},
                       {memory,8944},
                       {message_queue_len,0},
                       {reductions,2388},
                       {trap_exit,true}]},
                     {<0.5085.0>,
                      [{registered_name,[]},
                       {status,waiting},
                       {initial_call,{erlang,apply,2}},
                       {backtrace,
                        [<<"Program counter: 0x00002aaaabe73ef0 (gen:do_call/4 + 576)">>,
                         <<"CP: 0x0000000000000000 (invalid)">>,
                         <<"arity = 0">>,<<>>,
                         <<"0x00002aaabf118b68 Return addr 0x00002aaaabef1498 (gen_server:call/3 + 128)">>,
                         <<"y(0) #Ref<0.0.51.118144>">>,
                         <<"y(1) 'ns_1@10.176.215.197'">>,
                         <<"y(2) []">>,<<"y(3) infinity">>,
                         <<"(4) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>,
                         <<"y(5) '$gen_call'">>,<<"y(6) <0.4821.0>">>,
                         <<>>,
                         <<"x00002aaabf118ba8 Return addr 0x00002aaaafa76380 (couch_set_view:get_group_server/2 + 128)">>,
                         <<"y(0) infinity">>,
                         <<"(1) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>,
                         <<"y(2) couch_set_view">>,
                         <<"y(3) Catch 0x00002aaaabef1498 (gen_server:call/3 + 128)">>,
                         <<>>,
                         <<"0x00002aaabf118bd0 Return addr 0x00002aaaafa76550 (couch_set_view:get_group_info/2 + 40)">>,
                         <<>>,
                         <<"x00002aaabf118bd8 Return addr 0x00002aaaafa7f9a0 (couch_set_view:'-cleanup_index_files/1-f">>,
                         <<>>,
                         <<"0x00002aaabf118be0 Return addr 0x00002aaaabeb06c0 (lists:map/2 + 120)">>,
                         <<>>,
                         <<"x00002aaabf118be8 Return addr 0x00002aaaafa76828 (couch_set_view:cleanup_index_files/1 + 5">>,
                         <<"y(0) #Fun<couch_set_view.0.102244014>">>,
                         <<"(1) [{doc,<<19 bytes>>,{4,<<4 bytes>>},{[{<<5 bytes>>,{[{<<11 bytes>>,{[{<<3 bytes>>,<">>,
                         <<>>,
                         <<"x00002aaabf118c00 Return addr 0x00002aaab0d65490 (couchbase_compaction_daemon:maybe_compac">>,
                         <<"y(0) []">>,<<"y(1) []">>,
                         <<"y(2) <<7 bytes>>">>,<<>>,
                         <<"0x00002aaabf118c20 Return addr 0x00002aaaabeb1170 (lists:foreach/2 + 120)">>,
                         <<"y(0) [<<15 bytes>>,<<19 bytes>>]">>,
                         <<"(1) Catch 0x00002aaab0d654b0 (couchbase_compaction_daemon:maybe_compact_bucket/3 + 688">>,
                         <<"y(2) {config,30,80,nil,false,false}">>,
                         <<"(3) [<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<1">>,
                         <<"y(4) <<7 bytes>>">>,<<>>,
                         <<"x00002aaabf118c50 Return addr 0x00002aaab0d65028 (couchbase_compaction_daemon:compact_loop">>,
                         <<"y(0) #Fun<couchbase_compaction_daemon.3.77482903>">>,
                         <<"(1) [{<<14 bytes>>,[<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<">>,
                         <<>>,
                         <<"0x00002aaabf118c68 Return addr 0x000000000088e318 (<terminate process normally>)">>,
                         <<"y(0) []">>,<<"y(1) []">>,
                         <<"y(2) <0.5084.0>">>,<<>>]},
                       {error_handler,error_handler},
                       {garbage_collection,
                        [{min_bin_vheap_size,46368},
                         {min_heap_size,233},
                         {fullsweep_after,0},
                         {minor_gcs,0}]},
                       {heap_size,46368},
                       {total_heap_size,46368},
                       {links,[<0.5084.0>]},
                       {memory,371952},
                       {message_queue_len,0},
                       {reductions,390457},
                       {trap_exit,false}]}
Comment by damien [ 31/Jan/12 ]
I appears we have a btree related bug. There is a badarith error in the logs that is causing the view compaction to crash. The badarith error is in couch_view_compactor:update_task/2 and I believe is caused by division by zero, but if that happens then the indexes should be empty and the update_task/2 should not be called.

The only way that seems possible is if there are values in the primary btree indexes, but the row counts are 0. I believe this must be caused by the cleaning of vbuckets values from the indexes, which must not be properly computing the reductions when this happens.

I believe the compactor crash then causes the couch_file for the compaction file to be leaked, which means it cannot be opened again (due to couch_file_write_guard). There is actually an file_already_opened error in the logs which indicates this is happening.

I'm adding code to check for division by zero and exit with a diagnostic message. Reassigning to Filipe to look into the btree issue.
Comment by Filipe Manana [ 02/Feb/12 ]
Would be great if someone could repeat this test.

Neither I or Damien realize how to reproduce this neither why it could happen.
The following commit will help diagnose this better when it happens the next time.

https://github.com/couchbase/couchdb/commit/dd6546cad52c72421442b54eb59fe5984d913269
Comment by Steve Yen [ 03/Feb/12 ]
please try to reproduce (with Filipe's changes)
Comment by Filipe Manana [ 07/Feb/12 ]
This is same issues as MB-4774. One of them should be closed and marked as duplicate.
Fix in http://review.couchbase.org/#change,13067
Comment by Filipe Manana [ 08/Feb/12 ]
Fix merged today:

https://github.com/couchbase/couchdb/commit/6319846fa68c73580e5ead96dbe27868447f730f
Generated at Tue Sep 23 10:22:21 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.