Membase limitations or I'm missing something
I recently started evaluating Membase for one of the projects and was quite impressed with the ease with which it can be installed, re-balanced, and nodes can be added or removed from the cluster. When I started executing test cases, I started hitting some problems which I've described in this post. I'll appreciate if someone can explain if the findings are correct or not, and how we can address some of these findings.
Here we go...
Membase version 1.7
Application design and testing approach: Multi-threaded JUnit test. Multiple threads are created by JUnit test and each thread makes use of Apache HttpClient to send POST/GET requests to the a RESTful service created using Spring 3.x. The RESTful service in turn makes use of Spymemcached client library to interact with Membase (if Spymemcached is used as Type 2 client) or Moxi (if Spymemcached is used as a Type 1 client). Apache HTTP Server is the point of entry for requests and behind it I've 3 Tomcat instances running in cluster. The RESTful service is deployed as a web application in these Tomcat instances.
1. vBucket support in Spymemcached: I tried to use Spymemcached client library as a Type 2 client (vBucket aware). During testing it was found that when a node in the cluster is brought down, the whole cluster becomes unresponsive: ( see the following link, which suggests that some folks did face the same issue http://code.google.com/p/spymemcached/issues/detail?id=181) Other related issues: http://code.google.com/p/spymemcached/issues/detail?id=136, http://code.google.com/p/spymemcached/issues/detail?id=108
The rest of the findings here are based on using Spymemached as Type 1 client which sends requests to Moxi, which in turn interacts with Membase cluster.
2. It's not possible to avoid OOM errors in high load situations - When maximum memory allocated to a bucket is reached in Membase, temp OutOfMemory errors are reported if high number of requests per seconds are sent to membase. The test results showed that setting queue_age_cap, ep_mem_high_wat and ep_mem_low_wat parameters for Membase doesn't guarantee that OOM errors will not be reported.There already exists an issue in JIRA for this: http://www.couchbase.org/issues/browse/MB-4020.
Note that I am sending requests in a while loop, which doesn't really reflect the production environment.
3. Loss of data when application is under heavy load - It was found that when multiple threads are sending concurrent save requests to service, not all requests resulted in creation of an object in Membase. This means that data loss happens when the system is under heavy load. Membase doesn't report OOM errors, which means that the data loss may be happening because of request congestion on Apache HTTP Server or Apache Tomcat server. Has anyone come across this, and how did you tune your Apache HTTP Server and Tomcat servers?
4. Failure of a server node will result in loss of data - When a membase server fails, in-memory data that is not yet persisted or redistributed will be lost. I saw that under heavy load conditions, there were lot of 'Can't redistribute...Trying primary again..' messages, which gives me the indication that redistribution may also be failing under heavy load.
5. Shutting down a node brings down the performance of Membase drastically - The cluster performance comes down to 5-10 creates per seconds. The performance bounce backs only after the cluster is rebalanced manually or if you click the Failover button in Membase console. Is this still an issue as mentioned in this link: http://www.couchbase.org/forums/thread/understanding-membased-availabili....
6. Data loss during cluster warm-up phase - It was found that a newly configured Membase cluster needs sometime before it is ready to receive requests. During this warm-up phase requests were not successfully completed. I used to get a "PENDING" status in the Membase console. Sometimes, when a node in the cluster becomes unresponsive, then also I saw "PENDING" status with a Yellow status color, and during this time the Membase performance was not up to the expectation, and there was data loss.