[MB-4849] Server Crash - {write_loop_died,{badmatch,{error,enospc}}} Created: 29/Feb/12  Updated: 09/Jan/13  Resolved: 17/Apr/12

Status: Closed
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.0-developer-preview-4
Fix Version/s: 2.0-beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Tommie McAfee Assignee: damien
Resolution: Won't Fix Votes: 0
Labels: 2.0-dev-preview-4-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: dp4 build 717
3 node cluster
5 million docs
20 ddocs (1 view each - generic emit all map functions))

Attachments: File 10.2.2.31_errors.1     GZip Archive diags.tar.gz     PNG File Screen Shot 2012-02-29 at 10.06.58 AM.png    

 Description   
Looks like a cluster I left to create replica indexes overnight has crashed. At time of crash an empty MnesiaCore file was created, and attempts to restart couchbase service creates an empty erl_crash.dump. Excerpt from error log below with diags attached:


[error_logger:error] [2012-02-28 22:40:00] [ns_1@10.2.2.32:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server <0.29993.3> terminating
** Last message in was {'EXIT',<0.29996.3>,{badmatch,{error,enospc}}}
** When Server state == {file,<0.29995.3>,<0.29996.3>,15623309}
** Reason for termination ==
** {write_loop_died,{badmatch,{error,enospc}}}

[error_logger:error] [2012-02-28 22:40:00] [ns_1@10.2.2.32:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: couch_file:init/1
    pid: <0.29993.3>
    registered_name: []
    exception exit: {write_loop_died,{badmatch,{error,enospc}}}
      in function gen_server:terminate/6
      in call from couch_file:init/1
    ancestors: [<0.29990.3>,<0.29975.3>,<0.29974.3>]
    messages: [{'$gen_call',


 Comments   
Comment by Aleksey Kondratenko [ 29/Feb/12 ]
no space means no space. There's not much you can do when you exhaust FS space
Comment by Tommie McAfee [ 29/Feb/12 ]
did compaction fail?

The cluster has 120gb space.
Comment by Tommie McAfee [ 29/Feb/12 ]
Spoke with Filipe about this....who explained that compaction doesn't start until indexing finishes.
So what happens is my couch data disk size is 3.5 GB and
Couchbase is going to try and create main and replica index files for 20 view, and worst case(if I re-emit the entire db), my cluster would have to reserve an extra 140 GB (3.5gb*40) for queries.

Filipe says it's possible to implement some sort of incremental compaction or possibly giving compaction threads priority when necessary.

Comment by Tommie McAfee [ 29/Feb/12 ]
UI view of disk usage overhead
Comment by damien [ 13/Apr/12 ]
Filipe, can you look at this. If this bug is a invalid or limitation, just mark as Won't fix with a small explanation.
Comment by Filipe Manana [ 16/Apr/12 ]
Unfortunately once we get out of disk space, we can't have query views with ?stale=ok or ?stale=update_after (default).
We get a file_error from within ns_server (somewhere in the HTTP handlers / ALE logger):

$ curl 'http://localhost:9500/default/_design/test/_view/view1?limit=10&#39;
{"error":"badmatch","reason":"{error,{file_error,\"logs/n_0/log\",enospc}}"}

The relevant full stack trace:

[menelaus:warn] [2012-04-16 15:10:51] [n_0@192.168.1.80:<0.29010.0>:menelaus_web:loop:358] Server error during processing: ["web request failed",
                                 {path,"/pools/default"},
                                 {type,error},
                                 {what,function_clause},
                                 {trace,
                                  [{menelaus_stats,
                                    '-invoke_archiver/3-lc$^0/1-0-',
                                    [{'EXIT',
                                      {{badmatch,
                                        {error,
                                         {file_error,"logs/n_0/log",enospc}}},
                                       [{'ale_logger-stats',error,5},
                                        {stats_reader,latest,4},
                                        {menelaus_stats,invoke_archiver,3},
                                        {menelaus_stats,last_membase_sample,2},
                                        {menelaus_stats,last_bucket_stats,3},
                                        {menelaus_stats,basic_stats,3},
                                        {ns_storage_conf,
                                         '-do_cluster_storage_info/1-fun-2-',
                                         3},
                                        {lists,foldl,3}]}}]},
                                   {menelaus_stats,last_membase_sample,2},
                                   {menelaus_stats,last_bucket_stats,3},
                                   {menelaus_stats,basic_stats,3},
                                   {ns_storage_conf,
                                    '-do_cluster_storage_info/1-fun-2-',3},
                                   {lists,foldl,3},
                                   {ns_storage_conf,do_cluster_storage_info,1},
                                   {menelaus_web,build_pool_info,4}]}]

Technically the view engine is capable of serving queries with stale=ok|update_after if there's no space left on disk, as long as the logger doesn't crash when there's no disk space left.
Queries with ?stale=false will always get an error mentioning the posix error code 'enospc'.
Comment by Filipe Manana [ 16/Apr/12 ]
Damien, what's your call?
Comment by damien [ 17/Apr/12 ]
Running out of disk space shouldn't cause corruptions, but other than that we cannot do anything. A possible future feature is to have an admin function to purge all indexes, which will have them rebuilt from scratch, but will take a lot of disk IO and potential application downtime, that might be easily resolved in another way by the administrator.
Generated at Sun Sep 21 12:20:49 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.