[MB-4732] Compaction seems to be stuck (or not running) Created: 31/Jan/12 Updated: 10/Apr/12 Resolved: 01/Mar/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 2.0-developer-preview-4 |
| Fix Version/s: | 2.0-developer-preview-4 |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Tommie McAfee | Assignee: | Tommie McAfee |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 2.0-dev-preview-4-release-notes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | I have an 8 node cluster with disk size much larger then data. | ||
| Attachments: |
|
| Description |
|
Quoting Sharon's:
"Troubleshooting, I found many nodes where disk size was 4 times greater then on other nodes. Looking at one of these nodes where data is not compacted, Compaction seems to be stuck. http://50.18.98.4:8092/default%2F101 {"db_name":"default/101","doc_count":1807,"doc_del_count":0,"update_seq":2986,"purge_seq":0,"compact_running":false,"disk_size":4452469,"data_size":922673,"instance_start_time":"1328040896372522","disk_format_version":7,"committed_update_seq":2985} Cluster is at http://50.18.98.4:8091 (Administrator/password)" > Quoting Allaksey The cause of compaction daemon hang is the same as of views hangs. So generally this is the same bug. |
| Comments |
| Comment by Aliaksey Artamonau [ 31/Jan/12 ] |
|
Compaction daemon processes' backtraces:
{<0.5083.0>, [{registered_name,[]}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace, [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaaad443728 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>, <<"y(0) []">>,<<"y(1) infinity">>, <<"y(2) supervisor_cushion">>, <<"y(3) {state,couchbase_compaction_daemon,3000,{1328,40081,401252},<0.5084.0>}">>, <<"y(4) <0.5083.0>">>,<<"y(5) <0.4997.0>">>, <<>>, <<"0x00002aaaad443760 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]}, {error_handler,error_handler}, {garbage_collection, [{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,0}, {minor_gcs,0}]}, {heap_size,233}, {total_heap_size,233}, {links,[<0.4997.0>,<0.5084.0>]}, {memory,2840}, {message_queue_len,0}, {reductions,75}, {trap_exit,true}]}, {<0.5084.0>, [{registered_name,couchbase_compaction_daemon}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace, [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaabe24c5f8 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>, <<"y(0) []">>,<<"y(1) infinity">>, <<"y(2) couchbase_compaction_daemon">>, <<"y(3) {state,<0.5085.0>}">>, <<"y(4) couchbase_compaction_daemon">>, <<"y(5) <0.5083.0>">>,<<>>, <<"0x00002aaabe24c630 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]}, {error_handler,error_handler}, {garbage_collection, [{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,0}, {minor_gcs,0}]}, {heap_size,987}, {total_heap_size,987}, {links,[<0.5083.0>,<0.5085.0>]}, {memory,8944}, {message_queue_len,0}, {reductions,2388}, {trap_exit,true}]}, {<0.5085.0>, [{registered_name,[]}, {status,waiting}, {initial_call,{erlang,apply,2}}, {backtrace, [<<"Program counter: 0x00002aaaabe73ef0 (gen:do_call/4 + 576)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaabf118b68 Return addr 0x00002aaaabef1498 (gen_server:call/3 + 128)">>, <<"y(0) #Ref<0.0.51.118144>">>, <<"y(1) 'ns_1@10.176.215.197'">>, <<"y(2) []">>,<<"y(3) infinity">>, <<"(4) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>, <<"y(5) '$gen_call'">>,<<"y(6) <0.4821.0>">>, <<>>, <<"x00002aaabf118ba8 Return addr 0x00002aaaafa76380 (couch_set_view:get_group_server/2 + 128)">>, <<"y(0) infinity">>, <<"(1) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>, <<"y(2) couch_set_view">>, <<"y(3) Catch 0x00002aaaabef1498 (gen_server:call/3 + 128)">>, <<>>, <<"0x00002aaabf118bd0 Return addr 0x00002aaaafa76550 (couch_set_view:get_group_info/2 + 40)">>, <<>>, <<"x00002aaabf118bd8 Return addr 0x00002aaaafa7f9a0 (couch_set_view:'-cleanup_index_files/1-f">>, <<>>, <<"0x00002aaabf118be0 Return addr 0x00002aaaabeb06c0 (lists:map/2 + 120)">>, <<>>, <<"x00002aaabf118be8 Return addr 0x00002aaaafa76828 (couch_set_view:cleanup_index_files/1 + 5">>, <<"y(0) #Fun<couch_set_view.0.102244014>">>, <<"(1) [{doc,<<19 bytes>>,{4,<<4 bytes>>},{[{<<5 bytes>>,{[{<<11 bytes>>,{[{<<3 bytes>>,<">>, <<>>, <<"x00002aaabf118c00 Return addr 0x00002aaab0d65490 (couchbase_compaction_daemon:maybe_compac">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) <<7 bytes>>">>,<<>>, <<"0x00002aaabf118c20 Return addr 0x00002aaaabeb1170 (lists:foreach/2 + 120)">>, <<"y(0) [<<15 bytes>>,<<19 bytes>>]">>, <<"(1) Catch 0x00002aaab0d654b0 (couchbase_compaction_daemon:maybe_compact_bucket/3 + 688">>, <<"y(2) {config,30,80,nil,false,false}">>, <<"(3) [<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<1">>, <<"y(4) <<7 bytes>>">>,<<>>, <<"x00002aaabf118c50 Return addr 0x00002aaab0d65028 (couchbase_compaction_daemon:compact_loop">>, <<"y(0) #Fun<couchbase_compaction_daemon.3.77482903>">>, <<"(1) [{<<14 bytes>>,[<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<">>, <<>>, <<"0x00002aaabf118c68 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) <0.5084.0>">>,<<>>]}, {error_handler,error_handler}, {garbage_collection, [{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,0}, {minor_gcs,0}]}, {heap_size,46368}, {total_heap_size,46368}, {links,[<0.5084.0>]}, {memory,371952}, {message_queue_len,0}, {reductions,390457}, {trap_exit,false}]} |
| Comment by Damien Katz [ 31/Jan/12 ] |
|
I appears we have a btree related bug. There is a badarith error in the logs that is causing the view compaction to crash. The badarith error is in couch_view_compactor:update_task/2 and I believe is caused by division by zero, but if that happens then the indexes should be empty and the update_task/2 should not be called.
The only way that seems possible is if there are values in the primary btree indexes, but the row counts are 0. I believe this must be caused by the cleaning of vbuckets values from the indexes, which must not be properly computing the reductions when this happens. I believe the compactor crash then causes the couch_file for the compaction file to be leaked, which means it cannot be opened again (due to couch_file_write_guard). There is actually an file_already_opened error in the logs which indicates this is happening. I'm adding code to check for division by zero and exit with a diagnostic message. Reassigning to Filipe to look into the btree issue. |
| Comment by Filipe Manana [ 02/Feb/12 ] |
|
Would be great if someone could repeat this test.
Neither I or Damien realize how to reproduce this neither why it could happen. The following commit will help diagnose this better when it happens the next time. https://github.com/couchbase/couchdb/commit/dd6546cad52c72421442b54eb59fe5984d913269 |
| Comment by Steve Yen [ 03/Feb/12 ] |
| please try to reproduce (with Filipe's changes) |
| Comment by Filipe Manana [ 07/Feb/12 ] |
|
This is same issues as Fix in http://review.couchbase.org/#change,13067 |
| Comment by Filipe Manana [ 08/Feb/12 ] |
|
Fix merged today:
https://github.com/couchbase/couchdb/commit/6319846fa68c73580e5ead96dbe27868447f730f |