Membased keeps crashing - How to get logs?
Hi,
We having the problem where Membase server (RedHat 64-bit 1.7) stops responding during high load (network load ~10Mb/sec) and became into totally unrecoverable mode. Only un-installing/installing helps.
The 'service membase-server restart' doesn't help.
Also, it looks like logs locations were changed in 1.7.
Would anybody know how to get logs?
Thanks,
Yaroslav
And if that doesn't work (maybe REST is not operational). Then you can always use /opt/membase/bin/mbbrowse_logs. It'll output logs to standard output, so feel free to redirect it and compress before sending.
We're very interested in investigating your case. Especially if you'll find any mentions of mnesia crashes in logs.
Thanks for help. I got the ERROR REPORT.
Would you know where could be the problem?
INFO REPORT <5865.3989.0> 2011-06-14 14:53:17
===============================================================================
ns_1@127.0.0.1:<5865.3989.0>:ns_doctor:86: Current node statuses:
[{'ns_1@127.0.0.1',
[{last_heard,{1308,81195,893808}},
{active_buckets,
["2024533825","526073312","2065562229","450510688","961970259",
"1312345239","372087324","1289532462","689367529","1270610252",
"611875953","925308374","150849161","274420529","293367125",
"2040537661","892908939","78977041","347778411","1278521626",
"807741344","1433124005","178921995","1870259058","2065552831",
"629384034","795896643","1705438677","1992160224","1520582243",
"2065533328","1433748454","1382230166","800866463","1579860062",
"1931124676","36686171","1738060323","1426238935","11511542",
"166701701","218675437","1564600726","1271712315","668440688",
"1294169211","611973538","1427788785","1225587272","1551454742",
"942932103","2124042833","1402309635","1885026157","1261449675",
"863932940","487637568","1798100439","1884585055","1107762095",
"1338314146","973241774","599711141","1205997512","1722341718",
"670784308","2065470006","315940234","2004215042","961052648",
"766990485","669985598","1945034342","876407633","183468321",
"542393271","411340964","1507667825","1610116869","1703377065",
"1282566341","538975779","1055286788","204557393","1420013172",
"1427846795","1281898772","1056109004","1198134579","918644692",
"938247851","303638623","2065470008","1397515540","300598727",
"286431701","223933190","2008531099","1813031363","test",
"2121832081","2065348094","314839826","2065359214","996086360",
"919165340","1809897274","1427720965","1108828915","1586152628",
"2065338598","1437800007","293192818","1240174047","701827162",
"959268259","1984424065","1713732086","1980383565","1197171018",
"2138968960","586795174","1904430191","1014193498","2140093275",
"1804167506","1647165803","390529037","2113147203","948940135",
"653368563","70292529","2140694356","1384633945","321228108",
"199570422","1161438366","2079752159","793316311","1638472500",
"930217798"]},
{ready_buckets,
["2024533825","526073312","2065562229","450510688","961970259",
"1312345239","372087324","1289532462","689367529","1270610252",
"611875953","925308374","150849161","274420529","293367125",
"2040537661","892908939","78977041","347778411","1278521626",
"807741344","1433124005","178921995","1870259058","2065552831",
"629384034","795896643","1705438677","1992160224","1520582243",
"2065533328","1433748454","1382230166","800866463","1579860062",
"1931124676","36686171","1738060323","1426238935","11511542",
"166701701","218675437","1564600726","1271712315","668440688",
"1294169211","611973538","1427788785","1225587272","1551454742",
"942932103","2124042833","1402309635","1885026157","1261449675",
"863932940","487637568","1798100439","1884585055","1107762095",
"1338314146","973241774","599711141","1205997512","1722341718",
"670784308","2065470006","315940234","2004215042","961052648",
"766990485","669985598","1945034342","876407633","183468321",
"542393271","411340964","1507667825","1610116869","1703377065",
"1282566341","538975779","1055286788","204557393","1420013172",
"1427846795","1281898772","1056109004","1198134579","918644692",
"938247851","303638623","2065470008","1397515540","300598727",
"286431701","223933190","2008531099","1813031363","test",
"2121832081","2065348094","314839826","2065359214","996086360",
"919165340","1809897274","1427720965","1108828915","1586152628",
"2065338598","1437800007","293192818","1240174047","701827162",
"959268259","1984424065","1713732086","1980383565","1197171018",
"2138968960","586795174","1904430191","1014193498","2140093275",
"1804167506","1647165803","390529037","2113147203","948940135",
"653368563","70292529","2140694356","1384633945","321228108",
"199570422","1161438366","2079752159","793316311","1638472500",
"930217798"]},
{replication,
[{"2065470006",1.0},
{"689367529",1.0},
{"1433124005",1.0},
{"938247851",1.0},
{"183468321",1.0},
{"1426238935",1.0},
{"1722341718",1.0},
{"1804167506",1.0},
{"2008531099",1.0},
{"1437800007",1.0},
{"411340964",1.0},
{"1108828915",1.0},
{"1271712315",1.0},
{"293192818",1.0},
{"1884585055",1.0},
{"793316311",1.0},
{"538975779",1.0},
{"300598727",1.0},
{"2079752159",1.0},
{"166701701",1.0},
{"487637568",1.0},
{"1384633945",1.0},
{"390529037",1.0},
{"653368563",1.0},
{"1382230166",1.0},
{"178921995",1.0},
{"2065533328",1.0},
{"1610116869",1.0},
{"1507667825",1.0},
{"2121832081",1.0},
{"1198134579",1.0},
{"925308374",1.0},
{"863932940",1.0},
{"919165340",1.0},
{"1703377065",1.0},
{"1240174047",1.0},
{"347778411",1.0},
{"1312345239",1.0},
{"1270610252",1.0},
{"372087324",1.0},
{"2124042833",1.0},
{"1278521626",1.0},
{"1225587272",1.0},
{"78977041",1.0},
{"942932103",1.0},
{"800866463",1.0},
{"996086360",1.0},
{"1579860062",1.0},
{"1564600726",1.0},
{"303638623",1.0},
{"599711141",1.0},
{"1980383565",1.0},
{"2024533825",1.0},
{"807741344",1.0},
{"2065348094",1.0},
{"961052648",1.0},
{"1205997512",1.0},
{"2065470008",1.0},
{"959268259",1.0},
{"1870259058",1.0},
{"204557393",1.0},
{"293367125",1.0},
{"1931124676",1.0},
{"450510688",1.0},
{"314839826",1.0},
{"668440688",1.0},
{"150849161",1.0},
{"315940234",1.0},
{"1282566341",1.0},
{"286431701",1.0},
{"1294169211",1.0},
{"199570422",1.0},
{"2065552831",1.0},
{"218675437",1.0},
{"223933190",1.0},
{"321228108",1.0},
{"1056109004",1.0},
{"2113147203",1.0},
{"961970259",1.0},
{"948940135",1.0},
{"1289532462",1.0},
{"670784308",1.0},
{"1647165803",1.0},
{"1809897274",1.0},
{"1904430191",1.0},
{"1520582243",1.0},
{"2138968960",1.0},
{"1055286788",1.0},
{"1705438677",1.0},
{"1638472500",1.0},
{"1586152628",1.0},
{"1992160224",1.0},
{"2065359214",1.0},
{"274420529",1.0},
{"2140093275",1.0},
{"1427720965",1.0},
{"542393271",1.0},
{"1885026157",1.0},
{"930217798",1.0},
{"1402309635",1.0},
{"1713732086",1.0},
{"629384034",1.0},
{"11511542",1.0},
{"36686171",1.0},
{"2065562229",1.0},
{"1945034342",1.0},
{"1107762095",1.0},
{"1397515540",1.0},
{"892908939",1.0},
{"1420013172",1.0},
{"701827162",1.0},
{"766990485",1.0},
{"1813031363",1.0},
{"1738060323",1.0},
{"1014193498",1.0},
{"2140694356",1.0},
{"586795174",1.0},
{"1427846795",1.0},
{"1984424065",1.0},
{"1338314146",1.0},
{"1798100439",1.0},
{"2040537661",1.0},
{"2004215042",1.0},
{"918644692",1.0},
{"876407633",1.0},
{"2065338598",1.0},
{"1197171018",1.0},
{"669985598",1.0},
{"1427788785",1.0},
{"1281898772",1.0},
{"973241774",1.0},
{"611875953",1.0},
{"526073312",1.0},
{"1551454742",1.0},
{"795896643",1.0},
{"1261449675",1.0},
{"1161438366",1.0},
{"611973538",1.0},
{"70292529",1.0},
{"1433748454",1.0},
{"test",1.0}]},
{memory,
[{total,4535272224},
{processes,4396935040},
{processes_used,4396457472},
{system,138337184},
{atom,950561},
{atom_used,917759},
{binary,10837208},
{code,7644730},
{ets,100805832}]},
{system_stats,
[{cpu_utilization_rate,97.72893772893772},
{swap_total,2146787328},
{swap_used,0}]},
{interesting_stats,
[{curr_items,0},{curr_items_tot,0},{vb_replica_curr_items,0}]},
{cluster_compatibility_version,1},
{version,
[{os_mon,"2.2.5"},
{mnesia,"4.4.17"},
{kernel,"2.14.3"},
{sasl,"2.1.9.3"},
{ns_server,"1.7.0"},
{stdlib,"1.17.3"}]},
{system_arch,"x86_64-unknown-linux-gnu"},
{wall_clock,61},
{memory_data,{16892911616,9545502720,{<5865.4004.0>,39954920}}},
{disk_data,
[{"/",136124392,4},{"/boot",101086,19},{"/dev/shm",8248492,0}]},
{meminfo,
<<"MemTotal: 16496984 kB\nMemFree: 7569012 kB\nBuffers: 4840008 kB\nCached: 1238144 kB\nSwapCached: 0 kB\nActive: 3367848 kB\nInactive: 5263676 kB\nHighTotal: 0 kB\nHighFree: 0 kB\nLowTotal: 16496984 kB\nLowFree: 7569012 kB\nSwapTotal: 2096472 kB\nSwapFree: 2096472 kB\nDirty: 42616 kB\nWriteback: 0 kB\nAnonPages: 2553460 kB\nMapped: 54596 kB\nSlab: 249188 kB\nPageTables: 15280 kB\nNFS_Unstable: 0 kB\nBounce: 0 kB\nCommitLimit: 10344964 kB\nCommitted_AS: 5653140 kB\nVmallocTotal: 34359738367 kB\nVmallocUsed: 264000 kB\nVmallocChunk: 34359474147 kB\nHugePages_Total: 0\nHugePages_Free: 0\nHugePages_Rsvd: 0\nHugepagesize: 2048 kB\n">>},
{system_memory_data,
[{system_total_memory,16892911616},
{free_swap,2146787328},
{total_swap,2146787328},
{cached_memory,1263378432},
{buffered_memory,4956164096},
{free_memory,7923769344},
{total_memory,16892911616}]},
{statistics,
[{wall_clock,{53321,1}},
{context_switches,{2223914,0}},
{garbage_collection,{135331,1086465720,0}},
{io,{{input,126544095},{output,86911502}}},
{reductions,{287781019,55981197}},
{run_queue,6},
{runtime,{378050,70490}}]}]}]
ERROR REPORT <5865.66.0> 2011-06-14 14:53:17
===============================================================================
ns_1@127.0.0.1:<5865.66.0>:mb_mnesia:176: Mnesia detected overload during dump_log because of write_threshold
ERROR REPORT <5865.72.0> 2011-06-14 14:53:17
===============================================================================
Mnesia('ns_1@127.0.0.1'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
ERROR REPORT <5865.4650.0> 2011-06-14 14:53:20
===============================================================================
ns_1@127.0.0.1:<5865.4650.0>:stats_collector:121: Dropped 1 ticks
Hi,
This is how it has started:
CRASH REPORT <5865.31444.86> 2011-06-13 20:19:56
===============================================================================
Crashing process
initial_call {ns_janitor,cleanup,['Argument__1']}
pid <5865.31444.86>
registered_name []
error_info
{error,badarg,
[{erlang,hd,[[]]},
{mb_map,balance,3},
{ns_janitor,cleanup,1},
{proc_lib,init_p_do_apply,3}]}
ancestors
[<5865.187.0>,mb_master_sup,mb_master,ns_server_sup,
ns_server_cluster_sup,<5865.51.0>]
messages []
links [<5865.187.0>]
dictionary []
trap_exit false
status running
heap_size 17711
stack_size 24
reductions 1359
INFO REPORT <5865.187.0> 2011-06-13 20:19:56
===============================================================================
ns_1@127.0.0.1:<5865.187.0>:ns_orchestrator:178: Janitor run exited for bucket "611973538" with reason badarg
Please help if anybody knows?
Thanks,
Yaroslav
I have filed a bug for this issue (MB-3982). We will have an engineer look into the issue and follow up with you.
Yaroslav, I think the issue here has to do with how many buckets you have configured. We do have some known issues about supporting large numbers of buckets. Would it be possible to try the system with less than 10 buckets and ensure that it works properly for you?
Perry
Hi,
We have one bucket per website and using memcached buckets only. We use one bucket per site mostly because it`s only one way to invalidate all keys that belongs to one site. Or we would need to keep site <-> bucket/keys relationships somewhere else and sync every time when something gets changed.
The documentation says that we could have up to 1024 buckets and we assumed that few hundreds should be just fine. Is it the membase server limitation or memcached?
Would be it possible to tune up some configuration parameters to fix this issue?
I saw some suggestions about http://streamhacker.com/2008/12/10/how-to-eliminate-mnesia-overload-events/. Is it relevant?
Thanks,
Yaroslav
Hi Perry,
We can't try to setup even 10 buckets now because the server is down. I guess this is the biggest problem because membase server crashed unrecoverable. It wouldn't be the huge problem if it would work after service.
We could try to setup 10 buckets only but then we need re-install packages and the current service state will be lost.
Would you need any data prior we uninstall it?
Thanks,
Yaroslav
Hi Yaroslav,
Being scared my your post about crashing membase I did the following. I set up cluster of two machines and gave a little load on it. I did about 1 000 000 000 get-sets over 10 hours. For me everything runs smooth and cool. I use Ubuntu 10.10 Server.
Could you please describe you env. so people will know where problem could arise?
If you can send us output of /opt/membase/bin/mbbrowse_logs from then node that fails to start, that'll help.
Hi,
We run just one node on dedicated server DELL PowerEdge 2970:
Dell Memory: 16 GB DELL RAM, GB Memory: 16
Dell Servers: Dual Socket Quad Core AMD Opteron 2374HE 2.2 GHz, #Processors: 2, #Cores per Proc: 4
Hard Drive: 146GB SAS 15K RPM Drive, HDD RPM: 15000, GB Hard Drive: 146
Hard Drive: 146GB SAS 15K RPM Drive, HDD RPM: 15000, GB Hard Drive: 146
Hard Drive Size: 3.5 in. Hard Drives
IP Allocation: 1 IP, # IPs: 1
Linux OS: Red Hat Enterprise Linux 5 - 64 bit
RAID Configuration: RAID 1
advanced_networking: 1000Mb Port
Antivirus: Sophos
Membase Server Community Edition: 64-bit Red Hat Linux (RPM) 1.7
Thanks,
Yaroslav
Hi,
I have uploaded full report to https://rapidshare.com/files/2809827931/browse_logs.log
Please let me know if you need anything else?
Thanks,
Yaroslav
Thanks Yaroslac, I'll take a look at those, but I'm pretty sure you'll have to reinstall the software in order to properly test it.
I believe you may have been mistakenly reading the documentation. For Membase buckets, we create 1024 "vbuckets" which are the underlying datastructures used for "auto-sharding" and rebalancing.
As I mentioned, we have some known issues with supporting large number of buckets and I think even a few hundred would be quite problematic. We've identified some areas for optimization/improvement but at this time, the software is not going to work well for you like that.
In terms of being able to invalidate the items for specific datasets (websites) you might consider using key versioning instead:
-Basically, you have a "version" key for a website, let's call it "website1_version" and set it to "0"
-In your app code, you can do a get on this key first to retrieve the "0"
-When accessing/creating a particular key for this website, simply add the version field to the keyname: "website1_users_". At first, that will be "website1_users_0"
-Now, when you want to "flush" that particular website, you can simply incremement the version key to "1".
-When you go to access "website1_users_1" the key won't exist, and your application will regenerate it like you would anyway
While this does create a "second hop" for getting to the actual data, memcached is so fast that it really doesn't matter. This method is used almost across the board even for very large websites in order to be very granular in their control and versioning of keys.
You may still want multiple buckets, but you don't need one for each website which also helps when deploying new sites by not forcing you to create and configure a new bucket each time.
How does that sound?
You can also search for "memcache key versioning" and get more tutorials online.
Perry
Perry,
We will try versioning. What is the maximum number of keys and suggested keyname length?
Thanks,
Yaroslav
There is no technical limit on the number of keys...it's just related to how much RAM you have.
There is a 255 byte limit on the keyname. Shorter ones will take up less space, longer ones more.
Perry
You can do this one of two ways.
1) Log into the web ui. Click on logs in the menu on the left side of the page. Then click "Generate Diagnostic Report" at the top of the screen.
2) go to http://host:8091/diag