This is similar to http://www.couchbase.com/communities/q-and-a/java-client-not-aware-about-failed-over-node#comment-1940
This is using the 1.4.2 java client.
If I failover from the admin console while the node is still up… it does not reproduce. I found the best way to simulate the node suddenly becoming unresponsive is to use IP tables to block all traffic except SSH on port 22 like so:
To block a node
iptables -A INPUT -p tcp --dport 22 -j ACCEPT; iptables -A INPUT -j DROP
To unblock a node
iptables -F
It also seems to not reproduce if I remove the nested try/catch (i.e. don’t try to read from the replica).
Failover seems to not be instantaneous… it takes 1-2 minutes with my hardware and setup. The following steps can be seen with the code below (might be reproducible with less steps but this seems consistent):
- Create a two node cluster with 1 level of replication
- Set the code below with the proper host names, bucket name and password
- Run the code and you will see “Got From Master” 5 times
- It will then pause and ask you to block traffic from the master node
- Look at the admin console to see which node is the master for the key
- Block the master node with iptables and then hit ‘Enter’
- Go back to the output and it will then output “Got From Replica” 10 times
- It will then pause and ask for you to go to the admin console (on the replica node)
- Wait until the master node is marked as “Down”
- Once it is marked as “Down”, fail over the master node
- Go back to the console and hit ‘Enter’
At this point the console should continue printing “Got From Replica”. If you look at the admin console the replica node still has 0 items active and 1 item replicated. After 1-3 minutes it should suddenly say 1 item is active and 0 items are replicated (failover seems delayed). You will also notice at the same time that an exception started showing up in the console.
Expected: Once the node is fully failed over, it should no longer need to read from the replica and should read from the promoted master
Observed: It doesn’t seem to be able to read from the master or the replica. It appears that the client is not marking the promoted replica as the new master.
Questions:
- What is going on during the failover? I would have thought that failover would have been very fast and not take 1-5 minutes. Especially since I only have one item in the store
- Anyone know of a workaround? If I catch the exception and rebuild the client… it works. But this would be horrible since the client is accessed by multiple threads.
import java.net.URI;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
import net.spy.memcached.OperationTimeoutException;
import com.couchbase.client.CouchbaseClient;
public class CouchbaseClientTester {
public static void main( String[] args ) throws Exception {
List<URI> hosts = new ArrayList<URI>();
hosts.add(new URI("http://node1:8091/pools"));
hosts.add(new URI("http://node2:8091/pools"));
CouchbaseClient client = new CouchbaseClient(hosts, "bucketName", "password");
String key = UUID.randomUUID().toString();
client.add(key, "value");
for(int i = 0; i < 5; i++) {
couchbaseGet(client, key);
Thread.sleep(1000);
}
System.out.println("Block traffic from master node then Press 'Enter'");
System.in.read();
for(int i = 0; i < 10; i++) {
couchbaseGet(client, key);
Thread.sleep(1000);
}
System.out.println("Look at admin console 'Server Nodes' tab (from the replica node) and wait until the master node is marked as down (this can take 10-30 seconds). Then fail it over and then press 'Enter'");
System.in.read();
while(true) {
couchbaseGet(client, key);
Thread.sleep(1000);
}
}
private static Object couchbaseGet(CouchbaseClient client, String key) {
Object result = null;
try {
result = client.get(key);
System.out.println("Got From Master");
}
catch(OperationTimeoutException ex) {
try {
result = client.getFromReplica(key);
System.out.println("Got From Replica");
}
catch (OperationTimeoutException innerEx) {
//client.shutdown();
//createClient();
System.out.println("Exception");
innerEx.printStackTrace();
}
}
return result;
}
}