Replica data not properly flushed, TAP sometimes hangs, and eventually node crashes.
We have been using Membase for a year at low traffic without much problems, but now we seem to encounter an increasing amount of instability as we are scaling up. Here is what we do, please let us know if something goes against membase philosophy:
- We are running three high memory instances running Ubuntu 10.4 on Amazon Web Services with one replica and auto-failover (Membase community edition, v126.96.36.199).
- Our application performs around 3000 gets per seconds and 500 mutations, and some of our data expires after only 30 seconds. At peak times we have around one million keys in the database, which is really nothing compared to the size of the instances we are using (16Gb of RAM).
- We make extensive use of CAS updates (we would prefer to just use increment but we also want to update the time to live).
- Most of our data are counters and locks, but some of it is a serialized hash that can grow to a reasonable size.
- We perform a vacuum on the sqlite databases each hour.
- We run a custom TAP script to get a backup each night. We used to take a backup with mbbackup, but some data that what not flushed to disk was missing, and using the TAP interface gave us better results.
Now the symptoms:
- The number of replica items on a node keeps growing.
- The RAM consumption can suddenly and dramatically increase on a node, and our TAP backup hangs (I don't know if the fact that our TAP scripts hangs is the cause or the symptom. We send a simple DUMP request on each node, and we run it locally, so I'm not sure how this can become a problem).
- Eventually, one of the node crashes, the autofailover kicks in but we lose a very large amount of data (up to 24h of data loss!). This indicates that neither the replication nor the persistence are working as we would expect them to.
Any suggestion? Our cluster crashes once a week on average, it is quite a problematic situation for us.