'Web request failed' error repeated seen in Membase cluster
Hi, we had an outage the other day with our production cache cluster and looking at the logs the server seemed to have problem communicating with one another, we see entries that says one server sees other nodes go down and then soon after it'll see them come up and other nodes will this server go down and back up, etc. etc. and soon after that we see this error message continuously in the log:
menelaus_web:19:warning:server error during request processing - Server error during processing: ["web request failed",
for the next 15hrs or so until we restarted the membase process on each of the servers in the cluster.
At the first instance of this error, all our servers seemed to have lost their ability to talk to the cache but then recovered and continued to work whilst the above error continued, the only thing that wasn't working was the web console on port 8091. But then later on when another process runs and tries to persist data in the cache into SimpleDB we seemed to have picked up an old version of the objects at least for some of the players, almost as if the hashing of keys to servers had been changed..
Another curious thing we noticed is that on every one of the cache servers, everyday at 10:38AM and 10:38PM we see an entry by the 'Dhcp-Client' in the system events log that says:
"Your computer was successfully assigned an address from the network, and it can now connect to other computers."
This seems to be an AWS thing, but the private and public IP address of the instance doesn't seem to be affected, at least not usually.
Sorry that's a bit long winded, but ultimately we have a couple of questions:
- what could cause those errors to happen continuously
- would the mapping from virtual buckets to cache server change if these errors occur?