Memcache failing
Hi,
I have successfully tested NSMS on my local machine and then in a qa environment. I have just deployed it to production today and the memcached server seems to fail. I am not sure how to open the log file to read it and if you can give me some guidance on that I will check out what it says. The console application that comes with the server says the server is up.
When I restart the server from the services section the server will start and run for about 10 seconds and then the console just stops responding and the server is as well. I am getting "Memcache Protocol, Unknown opcode (33) Request" but I think that is just the sites trying to determine if the cache is working.
Config:
NSMS is on W2k8 32GB ram
running 2 webservers with roughly 85 sites all configured to use the same bucket.
using the Enyim asp.net client with
minPoolSize="10"
maxPoolSize="100"
connectionTimeout="00:00:10"
Thanks for all the help you guys have given so far.
Jim
I have just found some documentation that says that the default number of connections for a memcached server is 1024. If this is correct then having upto 8000 connections of load would definitely cause problems. Would this cause the NSMS to stop responding? Is there a way to increase the number of connections for the NSMS?
Thanks,
Jim
I am able to telnet into the cache and the web console log has the last time that I restarted the service.
What other differences are there in your QA vs Production environments? If you private message me your phone we can call you directly and help you resolve this more quickly.
The NSMS is the same instance for both environments just using different buckets. The code is currently the same between qa and production. All servers are running w2k8, the qa server has less ram. Everything else is the same. All servers are hosted on the same network switch.
Hi, Jim. You're correct, the default limit is 1024. You can increase this in the config file.
In Program Files\Northscale\Memcached Server\priv\config, look for the following section:
{port_servers,
[{'_ver', {0, 0, 0}},
{memcached, "./priv/memcached",
["-p", "11212",
"-E", "./priv/engines/bucket_engine.so",
"-e", "admin=_admin;engine=./priv/engines/default_engine.so;default_bucket_name=default;auto_create=false"
],
[{env, [{"MEMCACHED_CHECK_STDIN", "event"},
{"MEMCACHED_TOP_KEYS", "100"},
{"ISASL_PWFILE", "./priv/isasl.pw"}, % Also isasl path above.
{"ISASL_DB_CHECK_TIME", "1"}
]},
use_stdio,
stderr_to_stdout,
stream]
}
]
}.
Change this to:
{port_servers,
[{'_ver', {0, 0, 0}},
{memcached, "./priv/memcached",
["-p", "11212",
"-c", "10000",
"-E", "./priv/engines/bucket_engine.so",
"-e", "admin=_admin;engine=./priv/engines/default_engine.so;default_bucket_name=default;auto_create=false"
],
[{env, [{"MEMCACHED_CHECK_STDIN", "event"},
{"MEMCACHED_TOP_KEYS", "100"},
{"ISASL_PWFILE", "./priv/isasl.pw"}, % Also isasl path above.
{"ISASL_DB_CHECK_TIME", "1"},
{"EVENT_NOSELECT", "1"}
]},
use_stdio,
stderr_to_stdout,
stream]
}
]
}.
The relevant changes are the addition of -c 10000 to the command line and the EVENT_NOSELECT environment variable.
You'll also need to remove the Program Files\Northscale\Memcached Server\config\ns_1 directory, as this has a cached version of the config. You'll need to re-do any other configuration changes you've made and re-create your cluster if you've joined a cluster.
If the number of incoming connections exceeded that 1024 would that crash the server? Although I do feel that this needs to be changed I am not sure that makes sense to me. Shouldn't memcache just ignore those connections and move on?
Changing the number of connections has done the trick. I am still a little confused on why exceeding the connection limit, even excessively like I was, would cause the server to crash.
I did notice that the hit ratio view on the bucket seems to be a bit off or I am just thinking about it wrong. The ratio is sitting at 0 - 0.001 while my gets/sec avg 35-39 and the hits/sec are sitting at 32-36. This seems like my hit ratio should be between 80-90%.
Also it would be nice to have a bucket column on the cluster analytic page.
Otherwise I am very impressed with your quick and incredibly helpful support via the forum. I can't wait to see what additional features come out in your next release!
Thanks again,
Jim
There are several things the cluster manager needs to connect to memcached for, and we haven't really tested it in the situation where it can't connect for a long period of time. The normal thing for Erlang to do when it encounters a situation it doesn't expect is to crash the specific Erlang process (there are hundreds of these) that encountered the unexpected condition, with the expectation that the process will simply be restarted by its supervisor. Eventually, if a process repeatedly crashes, this propagates up the supervisor chain. This will eventually result in memcached's being restarted, which is usually the right thing to do since inability to connect for a long period of time would usually indicate that something was wrong with memcached.
We could make it so inability to connect to memcached wouldn't cause the cluster manager to restart memcached (or crash), but hopefully once we've increased the default connection limit beyond what anyone could reasonably use this situation won't occur.
Hi Jim,
Very glad to hear everything is working as expected now. We will be including this fix in our next release (1.0.2). Also, we are looking into the hit ratio inconsistency you are seeing as well. Please send me your contact information in a direct message (or email me directly rod [at] northscale.com) so we can follow-up sometime early next week. Thanks.
Hello Sean, Thanks for the detailed explanation.
We have a clustered server setup with memcached server version 1.0.3 - Win 64 and one bucket configured. I tried to increase the allowed connections in the server as per your instructions above, modified the config and removed ns_1 folder, restarted the server, joined the cluster and recreated the bucket.
After that I when I increase the maxPoolSize in the client to anything above 1000 I am getting an error. the client used is .Net C# with Enyim client version 3.5.
I am guessing there needs to be additional config for the bucket. Attaching the diag herewith. Please clarify.
Thanks
Ravi
Thanks Ravi. What is the error that you are receiving from the client? Also, just to confirm, you are using Enyim 2.5 correct?
Hello Perry,
Yes, it is Enyim client version 2.5.
Error details:
Message "The type initializer for 'NorthScale.Store.NorthScaleClient' threw an exception."
StackTrace " at System.Configuration.BaseConfigurationRecord.EvaluateOne(String[] keys, SectionInput input, Boolean isTrusted, FactoryRecord factoryRecord, SectionRecord sectionRecord, Object parentResult)\r\n at System.Configuration.BaseConfigurationRecord.Evaluate(FactoryRecord factoryRecord, SectionRecord sectionRecord, Object parentResult, Boolean getLkg, Boolean getRuntimeObject, Object& result, Object& resultRuntimeObject)\r\n at System.Configuration.BaseConfigurationRecord.GetSectionRecursive(String configKey, Boolean getLkg, Boolean checkPermission, Boolean getRuntimeObject, Boolean requestIsHere, Object& result, Object& resultRuntimeObject)\r\n at System.Configuration.BaseConfigurationRecord.GetSectionRecursive(String configKey, Boolean getLkg, Boolean checkPermission, Boolean getRuntimeObject, Boolean requestIsHere, Object& result, Object& resultRuntimeObject)\r\n at System.Configuration.BaseConfigurationRecord.GetSectionRecursive(String configKey, Boolean getLkg, Boolean checkPermission, Boolean getRuntimeObject, Boolean requestIsHere, Object& result, Object& resultRuntimeObject)\r\n at System.Configuration.BaseConfigurationRecord.GetSection(String configKey, Boolean getLkg, Boolean checkPermission)\r\n at System.Configuration.BaseConfigurationRecord.GetSection(String configKey)\r\n at System.Configuration.ClientConfigurationSystem.System.Configuration.Internal.IInternalConfigSystem.GetSection(String sectionName)\r\n at System.Configuration.ConfigurationManager.GetSection(String sectionName)\r\n at NorthScale.Store.NorthScaleClient..cctor() in d:\\EnyimMemcached\\Northscale.Store\\NorthScaleClient.cs:line 14"
Hi Jim,
Can you telnet to the servers on port 11211 and successfully connect to the cache? What does the web console log say (if anything, sounds like it has stopped responding)?
Thanks.
Rod