Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Membase | Membase Server 1.7.x

How could failover as quickly as possible when one of machines is down

9 replies [Last post]
  • Login or register to post comments
Mon, 07/25/2011 - 01:30
wangbin579
Offline
Joined: 05/11/2011
Groups: None

our system should failover as quickly as possible when one of machines is down.
But membase seems to react very slowly to find the server down.
Is it related to tcp timeout?
how could I achieve that?

one more question:
I have three servers,148,161,162.
After executing reboot command in 161 about 10 seconds,I tried to failover it but it failed.

the following is the result:

[wangbin@bgp176_162 ~]$ /opt/membase/bin/membase failover -c 10.130.12.162:8091
--server-failover=10.130.12.161:8091 -u Administrator -p xxxxxx
ERROR: unable to failover ns_1@10.130.12.161 (500) Internal Server Error
[u'Unexpected server error, request logged.']
ERROR: command: failover: 10.130.12.162:8091, 2

Top
  • Login or register to post comments
Mon, 07/25/2011 - 15:13
perry
Offline
Joined: 10/11/2010
Groups:

What version of Membase are you using? We've had a few bugs here and there in previous versions that should be resolved.

Also, 1.7.1 will be released today which will include an automatic failover feature.

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Mon, 07/25/2011 - 18:37
wangbin579
Offline
Joined: 05/11/2011
Groups: None

1.7 membase

for two situations:
1.the membase node is not running,then the cluster will find the node is down quickly.
2.the machine is down,then the cluster will find the node is down verly slowly(about 75 seconds)

it seems that it is related to tcp timeout.

Top
  • Login or register to post comments
Mon, 07/25/2011 - 18:41
perry
Offline
Joined: 10/11/2010
Groups:

Not exactly, tcp timeout is 2 hours...we're definitely not waiting that long ;-)

There are other layers in this stack, mostly related to Erlang's internal timeouts. I'll check, but I believe we've changed/improved this with 1.7.1 (just came out today). Can you retest with that?

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Mon, 07/25/2011 - 18:55
wangbin579
Offline
Joined: 05/11/2011
Groups: None

Yes,I can retest this.
waiting for 1.7.1

Top
  • Login or register to post comments
Mon, 07/25/2011 - 20:23
wangbin579
Offline
Joined: 05/11/2011
Groups: None

I tracked the tcp flow through the master and the down node.

60 3.309212 123.58.176.161 123.58.176.148 TCP 55017 > 21100 [PSH, ACK] Seq=3941 Ack=2951 Win=501 Len=59 TSV=4294798668 TSER=667099797
61 3.309456 123.58.176.148 123.58.176.161 TCP 21100 > 55017 [PSH, ACK] Seq=2951 Ack=4000 Win=571 Len=47 TSV=667099799 TSER=4294798668
62 3.474891 123.58.176.161 123.58.176.148 TCP 55017 > 21100 [ACK] Seq=4000 Ack=2998 Win=501 Len=0 TSV=4294798834 TSER=667099799
63 3.999973 123.58.176.148 123.58.176.161 TCP 21100 > 55017 [PSH, ACK] Seq=2998 Ack=4000 Win=571 Len=75 TSV=667100490 TSER=4294798834
64 4.005538 123.58.176.148 123.58.176.161 TCP 21100 > 55017 [PSH, ACK] Seq=3073 Ack=4000 Win=571 Len=52 TSV=667100495 TSER=4294798834
65 4.226317 123.58.176.148 123.58.176.161 TCP [TCP Retransmission] 21100 > 55017 [PSH, ACK] Seq=2998 Ack=4000 Win=571 Len=127 TSV=667100716 TSER=4294798834
66 4.678314 123.58.176.148 123.58.176.161 TCP [TCP Retransmission] 21100 > 55017 [PSH, ACK] Seq=2998 Ack=4000 Win=571 Len=127 TSV=667101168 TSER=4294798834
67 5.581316 123.58.176.148 123.58.176.161 TCP [TCP Retransmission] 21100 > 55017 [PSH, ACK] Seq=2998 Ack=4000 Win=571 Len=127 TSV=667102072 TSER=4294798834
68 7.389333 123.58.176.148 123.58.176.161 TCP [TCP Retransmission] 21100 > 55017 [PSH, ACK] Seq=2998 Ack=4000 Win=571 Len=127 TSV=667103880 TSER=4294798834
69 11.004330 123.58.176.148 123.58.176.161 TCP [TCP Retransmission] 21100 > 55017 [PSH, ACK] Seq=2998 Ack=4000 Win=571 Len=127 TSV=667107496 TSER=4294798834
70 18.233342 123.58.176.148 123.58.176.161 TCP [TCP Retransmission] 21100 > 55017 [PSH, ACK] Seq=2998 Ack=4000 Win=571 Len=127 TSV=667114728 TSER=4294798834
71 32.693356 123.58.176.148 123.58.176.161 TCP [TCP Retransmission] 21100 > 55017 [PSH, ACK] Seq=2998 Ack=4000 Win=571 Len=127 TSV=667129192 TSER=4294798834

72 37.692354 WwPcbaTe_03:f2:e6 HewlettP_3f:43:54 ARP Who has 123.58.176.161? Tell 123.58.176.148

the cluster found the node down after about 60 seconds(the time is not always the same).
the delay is too long for us(it should be less than 10 seconds,but now 60+)

Top
  • Login or register to post comments
Mon, 07/25/2011 - 21:19
perry
Offline
Joined: 10/11/2010
Groups:

Unfortunately you won't get it down to 10 seconds...that's too short of a time to be absolutely sure that a node is down. You wouldn't want a failover to kick in just because a node is a little slow, or there's a small network hiccup that would resolve itself.

In 1.7.1, the minimum is 30 seconds. We based that off of a number of conversations with many large customers.

If you really feel the need to have something even more immediate, you'll want to implement your own checks and use the REST API to trigger the failover.

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Sun, 08/07/2011 - 23:03
wangbin579
Offline
Joined: 05/11/2011
Groups: None

I tried 1.7.1,but it takes too long to failover.
when I reboot one machine every time,membase takes more than 75 seconds to discover the machine(node) down. After finding the machine down,it takes 60 or more seconds to auto-failover.

Another scenario:
After I reboot the machine ,I click the failover on the web UI or use the REST API to call failover,membase takes about 2 minites to failover the node.
So even if I would have implemented own checks to assert the node is down,it seems to have no effect.

Top
  • Login or register to post comments
Mon, 08/08/2011 - 11:27
perry
Offline
Joined: 10/11/2010
Groups:

This is a known issue, and has to do with how long it takes Erlang to identify the machine is down. We're planning on removing that limitation with 2.0.

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Sun, 08/14/2011 - 20:02
wangbin579
Offline
Joined: 05/11/2011
Groups: None

It seems that setting net_ticktime can solve this problems.

the following explains net_ticktime.

net_ticktime = TickTime :
Specifies the net_kernel tick time. TickTime is given
in seconds. Once every TickTime/4 second, all con-
nected nodes are ticked (if anything else has been
written to a node) and if nothing has been received
from another node within the last four (4) tick times
that node is considered to be down. This ensures that
nodes which are not responding, for reasons such as
hardware errors, are considered to be down.

The time T, in which a node that is not responding is
detected, is calculated as: MinT < T < MaxT where:

MinT = TickTime - TickTime / 4
MaxT = TickTime + TickTime / 4

TickTime is by default 60 (seconds). Thus, 45 < T < 75
seconds.

Note: All communicating nodes should have the same
TickTime value specified.

Note: Normally, a terminating node is detected immedi-
ately.

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker