Different passwords for buckets provokes failure of whole cluster - reproduced 2 times already. First time accidentally, second time for confirmation. 3 servers in cluster and 1 in another DC (with xdcr).
Password is same but nodes are still under effect - sometimes some node randomly goes down... And I have to monitor it. And reboot manually...
It looks like your cluster is not stable and you have posted a question about that.
Let's fix the other issue first.
Cluster was stable enough before this day.
Cluster is 3 nodes with 512 RAM each (I know it's not big size of RAM, but actually there is only 12 very small items in the whole cluster).
Today I've tried to create second (empty yet) bucket and cluster became unstable.
You can find logs archive by the link in previous message - I hope it will be useful for your team to find reason of bug and fix it. But what about me - tonight I have to move data from the Couchbase cluster to some more stable DB. Sorry to say, but.. too unstable.
And for info: after removing second bucket (which was empty), cluster became stable again.
I'd like to understand what happened in your case, but you've uploaded partial diag. It doesn't have a single line of our logs, for example. May I have a complete diag ?
It's all what I've got from Couchbase web-dashboard, without cuts or edits.
Servers (in last hours it was 4GB linode VPS) are removed already, sorry.
If reason was not enough memory and not different passwords, then please show some warning in dashboard, something like "not enough memory to keep empty bucket". Because when 2 servers with 512 MB is not enough to keep 1 empty bucket and 1 bucket with 12 items... well, it's not obvious. I'm still not sure what was the reason.
Also, when I removed 1 node, after rebalancing 3 items from bucket were lost. Just disappeared. You know, when each item contain set of permissions for CEO, even 1 lost item is a problem :)
Lack of memory can be the reason.
We are aware that we have a bit too high memory requirements. AFAIK 2.0 officially supported configuration starts from 8 gigs of ram.
But once you're past that initial "investment of ram" we are quite good at using that ram.
In other words we need somewhat beefy hardware, there's no way to run us cheap virtualized instances as of now. Our focus is somewhat more on production environments where any serious deployment will perhaps start with at least 32 gigs.
But we're aware that current situation hampers developers adoption. I cannot say we'll fix that high initial cost soon, but we're working on that.
Lack of memory can be the reason.
So why 0 messages about memory in dashboard? All was "green".
Dual-core CPU running at 2GHz
4GB RAM (physical)
For development and testing purposes a reduced CPU and RAM configuration than the minimum specified can be used. This can be as low as 256MB of free RAM (beyond operating system requirements) and a single CPU core.
We need cluster of 4 VPS as minimum (3 in one DC and 1 in another). Even with 4Gb RAM is not cheap ($636/month), and it's not main database in our project. 8Gb*4 = $1280 - this "initial" price is not cheap. http://www.forbes.ru/reitingi-photogallery/234873-30-krupneishih-kompani... - as you can see, we can afford this DB. We currently use 3*32GB MySQL nodes in Amazon RDS (multi-AZ), and it's even more than wee need.
But it doesn't mean that R&D department can buy anything without reason :)
And when cluster is not stable with 12 small items... I just can't trust. I saw lot of big companies' titles in your customers list, but what I see in my practice is scares me. There was plans to use Couchbase for sessions, for logs, for permissions list, for analytics stats... Maybe later, when Couchbase will be stable mature DB.
Well in order to have any conclusions we'll need some logs from you. It appears that you're running xdcr. And at present xdcr is a known memory hog. And I'm pretty sure it was documented in some sizing guidelines.
As I noted above, we are aware of that cost issue and it's being worked on.
Looks like this case is more interesting then it appears initially.
Unfortunately we don't have full logs from this incident. Somehow diag (sent by Eugeny over email) is truncated before it's "header" part is complete.
But that header part has ns_config dump. And from that dump I see that at least in past this cluster had nodes in different data centers. And _in different continents_.
While in principle this configuration should work, we never tested how it behaves and some bugs are possible. There's clearly much more latency between such nodes than we expect.
In case somebody reads this. You are not supposed to have "cluster" distributed across continents or even data centers. Cluster nodes are supposed to be "close" in terms of network.
This also has security aspect. We don't intend to encrypt or protect in any way _intra_ cluster replication. We assume intra cluster links are secure.
And in case somebody reads this thus far. XDCR (inter-cluster replication) currently doesn't have any encryption as well. I believe our manuals don't stress this enough. You are supposed to run XDCR over _VPN or some other secure "tunnel"_. We currently don't provide secure inter-cluster replication, relying on admininstrators to set it up securely. Running XDCR "in the open" exposes you to really severe danger of not just exposing your data, but complete "pwn-age" of your destination cluster.
Now let me return to original topic. I don't know if problem occurred while cluster was "distributed" or not, but if someone needs geographically distributed replication then please note, that only supported way to do that is XDCR.
All is not so difficult actually, there was 3 nodes in London (Linode) and 1 node in Ireland (AWS), but NOT as part of cluster, but as XDCR. Just for replication. So cluster was not "distributed" geographically.
Again, all my logs:
And screenshot of dashboard:
First log in this list is clearly from some other cluster and it's much earlier.
Last two are sadly all incomplete. Middle one is most complete but as I said it just gave me config.
Now on XDCR.
I do see xdcr defined in config.
But I also see from vbucket map history that there _was_ a point in cluster life where both linode and AWS nodes were part of cluster. But I think I may have confused eu.west and us.west and assumed it's across continents when it was not.
And I have already noted that I have no idea if issue occurred when cluster was distributed or not. I'd need full logs to see what happened and when.
© 2013 COUCHBASE All rights reserved.