Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Couchbase | Couchbase Server 1.8.x

Rebalance fail and more

11 replies [Last post]
  • Login or register to post comments
Sun, 12/23/2012 - 07:12
uvmarko
Offline
Joined: 07/19/2012
Groups: None

Hi,
I am trying to upgrade a cluster from community build 1.8.1 to 2.0.
I have created a cluster 2 nodes running 1.8.1 with 1 bucket with 2 nodes on aws linux 64bit
the bucket shows 19.3 K items on one node and 19.6k items on the second one with 0 replica items - although replica is setup for the bucket.
So I am trying to upgrade to ver 2.0. I tried removing one node and rebalancing and rebalance fails with this error :
Rebalance exited with reason {{change_filter_failed,
{'EXIT',
{{badmatch,{error,timeout}},
{gen_server,call,
[<0.21971.566>,start_vbucket_filter_change,
30000]}}}},
[{ns_vbm_sup,change_vbucket_filter,4},
{ns_vbm_sup,'-set_replicas/3-fun-2-',5},
{lists,foldl,3},
{ns_vbm_sup,set_replicas,3},
{ns_vbm_sup,'-set_replicas_on_nodes/3-fun-1-',
3},
{lists,foreach,2},
{ns_vbm_sup,apply_changes,2},
{ns_vbucket_mover,sync_replicas,0}]}

after several failed attempts I tried adding another node with version 2.0 - the server is added successfully when I hit rebalance but again I get the following error:
Rebalance exited with reason {{change_filter_failed,
{'EXIT',
{{badmatch,{error,timeout}},
{gen_server,call,
[<18363.23001.566>,
start_vbucket_filter_change,30000]}}}},
[{ns_vbm_sup,change_vbucket_filter,4},
{ns_vbm_sup,'-set_replicas/3-fun-2-',5},
{lists,foldl,3},
{ns_vbm_sup,set_replicas,3},
{ns_vbm_sup,'-set_replicas_on_nodes/3-fun-1-',
3},
{lists,foreach,2},
{janitor_agent,
do_bulk_set_vbucket_state_old_style,4},
{ns_vbucket_mover,handle_call,3}]}

Rebalance appears to be swap rebalance
The 2.0 server is added to the cluster, appears as green in the admin console, and even taking over as master, trying to remove one of the 1.8.1 servers still fails to rebalance.
What should I do to resolve this and upgrade all my nodes?
also why is the cluster no replicating the items?

Thanks
Yuval

Top
  • Login or register to post comments
Thu, 12/27/2012 - 16:20
ingenthr
Offline
Joined: 03/16/2010
Groups:

That looks like MB-7108, but this was fixed in 2.0 GA.  Is this with the current GA release?  Well, match also seems to match a few other rebalance failures.

To address your questions...

What should I do to resolve this and upgrade all my nodes?

I'd recommend looking at how the vbuckets are distributed through the cluster through the Web-UI.  Hopefully, whenever you request a rebalance we're making progress.  Rebalance may fail, but if we're successfully moving some vbuckets before that failure, we can get there eventually.  This isn't expected, but we can work our way to resolution from here.

also why is the cluster no replicating the items?

I don't have a good explanation for this.  Can you possibly do a cbcollect_info and upload it per the directions here.  We'll get someone to have a look at it.

Top
  • Login or register to post comments
Sat, 12/29/2012 - 00:56
uvmarko
Offline
Joined: 07/19/2012
Groups: None

Rebalancing doesn't seem to do anything, item count per node stays the same as weel as active vBuckets count in the resources section.
also rebalance fails even without adding a 2.0 node - with a cluster of all 1.8.1 nodes.
I've uploaded the cbstats from all nodes (ignore the .info file I uploaded - it's a duplicate of one of the nodes).

Thanks for the help!

Top
  • Login or register to post comments
Wed, 01/02/2013 - 10:15
ingenthr
Offline
Joined: 03/16/2010
Groups:

Sorry to hear it's staying the same. The first change we'll want to try is tuning some timeout values a bit higher to give this processing a chance to complete rebalance.

This is a bit low level, but I can help you through it.  What's been recommended to me is extending a number of different timeout values.  You can do this with this kind of approach:

wget -O- --user=Administrator --password=password --post-data='ns_config:set({node, node(), {timeout, ns_memcached_outer_very_heavy}}, 120000).' http://10.3.2.83:8091/diag/eval|http://10.3.2.83:8091/diag/eval

This would adjust 'ns_memcached_outer_very_heavy'.  Now, where to adjust these to depends quite a bit on the environment.  For now, let's double them in your environment to see if that helps us get further with rebalance.  Using the wget style request outlined above, adjust them with these values.

  1. ns_memcached_outer, 120000
  2. ns_memcached_outer_heavy, 120000
  3. ns_memcached_outer_very_heavy, 240000
  4. ns_memcached_open_checkpoint, 120000
  5. ns_memcached_connected, 10000
  6. ebucketmigrator_connect, 120000

 This can be done on the live running cluster. It takes effect immediately.  Then retry the rebalance.

Let us know how this goes.

Top
  • Login or register to post comments
Thu, 01/03/2013 - 03:29
uvmarko
Offline
Joined: 07/19/2012
Groups: None

Thanks, but I cant get that command to work properly
when I type it as you entered it , with changing the ip and password, I get this response

 
-bash: <a href="http://10.28.91.176:8091/diag/eval:" title="http://10.28.91.176:8091/diag/eval:">http://10.28.91.176:8091/diag/eval:</a> No such file or directory
--2013-01-03 10:12:08--  <a href="http://10.28.91.176:8091/diag/eval<br />
Connecting" title="http://10.28.91.176:8091/diag/eval<br />
Connecting">http://10.28.91.176:8091/diag/eval<br />
Connecting</a> to 10.28.91.176:8091... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Connecting to 10.28.91.176:8091... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2
Saving to: `STDOUT'
 
 0% [                                                                                                                                                                 ] 0           --.-K/s   in 0s      
 
 
Cannot write to `-' (Broken pipe).

and when I omit the -O- tag from the command, I get this :

-bash: <a href="http://10.29.237.181:8091/diag/eval:" title="http://10.29.237.181:8091/diag/eval:">http://10.29.237.181:8091/diag/eval:</a> No such file or directory
--2013-01-03 10:18:41--  <a href="http://10.29.237.181:8091/diag/eval<br />
Connecting" title="http://10.29.237.181:8091/diag/eval<br />
Connecting">http://10.29.237.181:8091/diag/eval<br />
Connecting</a> to 10.29.237.181:8091... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Connecting to 10.29.237.181:8091... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2
Saving to: `eval.5'

since I see the Unauthorized error, I am guessing it fails, what am I doing wrong?
should I run this script on every node in the cluster or are these cluster wide settings?
Thanks and a Happy new year!
Yuval

Top
  • Login or register to post comments
Thu, 01/03/2013 - 12:48
ingenthr
Offline
Joined: 03/16/2010
Groups:

The 401 response would be owing to invalid credentials. Either the Administrator username is different or the password is different in your case. Do keep in mind, special characters may need to be quoted, if you have them in the username or password.

Once you get this working, it needs only be done on one node of the cluster. The change should propagate to all other nodes.

Top
  • Login or register to post comments
Fri, 01/04/2013 - 04:25
uvmarko
Offline
Joined: 07/19/2012
Groups: None

if I type my credentials wrong I get an "Authentication failed" message as followed, so I guess I am entering them correctly

-bash: <a href="http://10.29.237.181:8091/diag/eval:" title="http://10.29.237.181:8091/diag/eval:">http://10.29.237.181:8091/diag/eval:</a> No such file or directory
--2013-01-04 11:22:40--  <a href="http://10.29.237.181:8091/diag/eval<br />
Connecting" title="http://10.29.237.181:8091/diag/eval<br />
Connecting">http://10.29.237.181:8091/diag/eval<br />
Connecting</a> to 10.29.237.181:8091... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Connecting to 10.29.237.181:8091... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Authorization failed.

Top
  • Login or register to post comments
Fri, 01/04/2013 - 22:33
ingenthr
Offline
Joined: 03/16/2010
Groups:

I'm not certain why. I'll have a colleague look into it and post back.

Top
  • Login or register to post comments
Mon, 01/07/2013 - 14:22
jin
Offline
Joined: 01/16/2012
Groups: None

Hi it appears to be that the above error return (401) was due to an authentication failure. Maybe typos in user id/password? Please verify if all was correct.
The rest error returns (first two 401 responses from earlier posts) seem to be normal behavior. Given the wget command it first attempted to send the request without password then immediately succeeded during the second attempt with password. Thus the particular timeout value must have gotten changed correctly. Thanks for your time and help!

Top
  • Login or register to post comments
Tue, 01/08/2013 - 04:01
uvmarko
Offline
Joined: 07/19/2012
Groups: None

Thanks Jin,
Well if thats the case and I changed the values correctly the first time, it did not solve the Rebalance issues I've been having. what should I do next?

Top
  • Login or register to post comments
Tue, 01/08/2013 - 08:24
uvmarko
Offline
Joined: 07/19/2012
Groups: None

Hi,
I've decided to drop the cluster and install a fresh new one before going into production. so I installed 2 Couchbase community servers on AWS EC2s, using the rpm package and not the AMI. everything worked fine but no replication was going on and adding a new server caused rebalance to fail again, this time with a "no reply" message from the other servers. the third node was added successfully but it did not receive any new items and not finishing rebalance just like my 1.8 installation.
so I did a fresh install using the Community server AMI this time, but the data library that is set there by default is not pointing to the EBS the AMI installed.
so I had to change the ownership of that volume and then setting it up in the setup dialog. which is not noted anywhere in your docs.
Now adding new nodes and rebalancing works.
So my guess is that there was something wrong with the way I've configured the addresses in my previous faulty installation
I did it by modifying /opt/couchbase/bin/couchbase-server and adding the private ip from amazon. I've noticed that the AMI installation does not add the ip address to the couchbase-server file
how does the AMI installation configures the ip addresses?

Top
  • Login or register to post comments
Tue, 01/08/2013 - 16:15
jin
Offline
Joined: 01/16/2012
Groups: None

Hi thanks much for your detailed step-by-step explanation. We will bring your info to right support and get back to you soon.

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker