Losing data when restarting a node?
I have a 16-node cluster running the dev-preview 4 version of couchbase 2.0 on centos 6.2. I have two couchbase buckets set up on it, each with 8gb memory allocated per node. I'm using dedicated ports for these buckets and accessing them with a memcached client via moxi, currently residing on the couchbase cluster. Total number of items is currently ~16 million keys in one bucket and 7 million in the other. Load is ~3k ops/sec on the larger bucket and ~1.5k ops/sec on the smaller, only doing create/update/sets currently. Nothing is doing gets. We were having some issues with rebalancing and failover on dp3, so I've been testing failure modes on dp4.
I restarted a node (service couchbase-server restart) on one node, a few hours apart. The first time (~1 million active items/2 million replica items per node) it went down, then came back up and said it had 0 active and 0 replica items. A few seconds later, it had ~500k active items and ~1 million replica items, then within a minute or two it was back to the normal load.
The second time I restarted the same node (~1.2 million active items/2.4 million replica items per node), it initially came up with 0 active and 0 replica items again. The smaller bucket then loaded properly, and had ~400k items like it should have. The larger bucket appears to have dropped from ~850k items to ~450k items on the node I restarted, but the other nodes in the cluster stayed constant. The node I restarted is consistently showing ~400k less items than the other nodes in the cluster. The number of replicas dropped as well after the node rejoined the cluster.
There were a few errors in the logs on the node I restarted, talking about exiting with badmatch errors, but it looks like similar errors showed up on the restart that didn't appear to drop any (or at least not nearly as many) items.
Is there anything that can be done about this? Just something to watch out for when a node goes down? I saved the log.# file for both restarts in case there's anything else valuable in there.
I have noticed the same thing on a single instance, but have found ignoring the Admin UI helps.
Trying to select documents from the web console seems to throw the server for a loop with cycles of Up/Pend/Up and the Item Count reported is almost never correct, but when tested from application code, the views at least seem intact. I am only doing set and get operations so I don't know what you should expect from add operations which may behave differently.
This is definitely not expected.
Just to give a bit of background, pending is shown typically when a system is warming up and the vbuckets are 'pending'. They typically stay 'active' and there's no reason for them to drop into pending.
One question for both of you, is there a possibility that you have either cloned the underlying VMs or have another cluster with similar admin credentials, and one node had moved from one cluster to another? There's one known issue related to that.
Just to ensure we track this to solution, I've filed an issue:
http://www.couchbase.com/issues/browse/MB-5423
If there are still any logs about or additional information, it would be great to attach it to those issues.
Yup, I have a 9 node cluster with 32GB of RAM. It does the same thing.
1. A node will fill it's RAM, then go into a PEND state.
2. I reboot the service on the node.
3. The node starts up and loads its docs.
4. It drops all it's documents and goes back into PEND state.
5. Loads it's docs up again then goes UP state.
6. Then repeats the loss of docs and PEND/UP state.
Basically, it seems DP4 is completely unusable for BETA testing in my current opinion. Am I missing something? Is anyone else having these issues?