Java client not aware of failed over node under certain circumstances

This is similar to http://www.couchbase.com/communities/q-and-a/java-client-not-aware-about...

This is using the 1.4.2 java client.

If I failover from the admin console while the node is still up... it does not reproduce. I found the best way to simulate the node suddenly becoming unresponsive is to use IP tables to block all traffic except SSH on port 22 like so:

To block a node

iptables -A INPUT -p tcp --dport 22 -j ACCEPT; iptables -A INPUT -j DROP

To unblock a node

iptables -F

It also seems to not reproduce if I remove the nested try/catch (i.e. don't try to read from the replica).

Failover seems to not be instantaneous... it takes 1-2 minutes with my hardware and setup. The following steps can be seen with the code below (might be reproducible with less steps but this seems consistent):

1) Create a two node cluster with 1 level of replication
2) Set the code below with the proper host names, bucket name and password
3) Run the code and you will see "Got From Master" 5 times
4) It will then pause and ask you to block traffic from the master node
5) Look at the admin console to see which node is the master for the key
6) Block the master node with iptables and then hit 'Enter'
7) Go back to the output and it will then output "Got From Replica" 10 times
8) It will then pause and ask for you to go to the admin console (on the replica node)
9) Wait until the master node is marked as "Down"
10) Once it is marked as "Down", fail over the master node
11) Go back to the console and hit 'Enter'

At this point the console should continue printing "Got From Replica". If you look at the admin console the replica node still has 0 items active and 1 item replicated. After 1-3 minutes it should suddenly say 1 item is active and 0 items are replicated (failover seems delayed). You will also notice at the same time that an exception started showing up in the console.

Expected: Once the node is fully failed over, it should no longer need to read from the replica and should read from the promoted master

Observed: It doesn't seem to be able to read from the master or the replica. It appears that the client is not marking the promoted replica as the new master.

Questions:

1) What is going on during the failover? I would have thought that failover would have been very fast and not take 1-5 minutes. Especially since I only have one item in the store
2) Anyone know of a workaround? If I catch the exception and rebuild the client... it works. But this would be horrible since the client is accessed by multiple threads.

package com.couchbase.failover.testing;
 
import java.net.URI;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
 
import net.spy.memcached.OperationTimeoutException;
 
import com.couchbase.client.CouchbaseClient;
 
public class CouchbaseClientTester {
 
    public static void main( String[] args ) throws Exception {
 
        List<URI> hosts = new ArrayList<URI>();
 
        hosts.add(new URI("http://node1:8091/pools"));
        hosts.add(new URI("http://node2:8091/pools"));
 
        CouchbaseClient client = new CouchbaseClient(hosts, "bucketName", "password");
 
        String key = UUID.randomUUID().toString();
 
        client.add(key, "value");
 
        for(int i = 0; i < 5; i++) {
 
            couchbaseGet(client, key);
 
            Thread.sleep(1000);
        }
 
        System.out.println("Block traffic from master node then Press 'Enter'");
        System.in.read();
 
        for(int i = 0; i < 10; i++) {
 
            couchbaseGet(client, key);
 
            Thread.sleep(1000);
        }
 
        System.out.println("Look at admin console 'Server Nodes' tab (from the replica node) and wait until the master node is marked as down (this can take 10-30 seconds). Then fail it over and then press 'Enter'");
        System.in.read();
 
        while(true) {
 
            couchbaseGet(client, key);
 
            Thread.sleep(1000);
        }
    }
 
    private static Object couchbaseGet(CouchbaseClient client, String key) {
 
        Object result = null;
 
        try {
 
            result = client.get(key);
            System.out.println("Got From Master");
        }
        catch(OperationTimeoutException ex) {
 
            try {
 
                result = client.getFromReplica(key);
                System.out.println("Got From Replica");
            }
            catch (OperationTimeoutException innerEx) {
                //client.shutdown();
                //createClient();
                System.out.println("Exception");
                innerEx.printStackTrace();
            }
        }
 
        return result;
    }
}

Just realized where the JIRA for the java client was... made a ticket:

http://www.couchbase.com/issues/browse/JCBC-467

0 Answers

No answers yet