Inside the Java SDK: Bootstrap

From time to time we get questions on topics that do not directly fit into the documentation since they dig deeper into the internals of the client libraries. In this series we’ll cover different themes of interest – in this one it is how the SDK bootstraps itself.

Note that this post was written with the Java SDK 2.5.9 / 2.6.0 releases in mind. Things might change over time, but the overall approach should stay mostly the same.

First Steps

To know which remote hosts to ask for a config in general, we rely on the list of hostnames that the user provides through the CouchbaseCluster API. So for example if this code is provided:

CouchbaseCluster cluster = CouchbaseCluster.create("node1", "node2", "node3");

1	CouchbaseCluster cluster = CouchbaseCluster.create("node1", "node2", "node3");

Then we are going to use node1 , node2 and node3 as part of the initial bootstrap process. It’s important to understand at this point that no actual connection is established just yet. The code just stores these three hosts in the ConfigurationProvider as seed hosts. They are not stored as strings though, but rather as a NetworkAddress each of which provides a convenient and more feature rich wrapper over the plain JDK InetAddress . For example it can be modified through system properties to disallow reverse DNS lookups or force IPv4 over IPv6 even if the JDK is configured differently.

If no host is provided, 127.0.0.1 will be set as a seed host.

For readers out there who are also using an SDK which relies on the connection string for configuration: the Java SDK will parse such a string when given, but will not use any of the configuration settings other than the list of bootstrap nodes. Every configuration must be applied through the CouchbaseEnvironment and its builder or through system properties where available.

One last thing we need to cover before we can open a bucket and get into the weeds. If the CouchbaseEnvironment is configured with dnsSrvEnabled , then during the cluster construction phase the code will perform a DNS SRV lookup to retrieve the actual hosts instead of the DNS SRV reference host. The code utilizes JVM classes to do this, the code can be found here.

Getting a Config

Now that we have a list of seed nodes to work with, as soon as a bucket gets opened the real work begins. Let’s consider the following code snippet as a starting point:

Bucket bucket = cluster.openBucket("travel-sample");

1	Bucket bucket = cluster.openBucket("travel-sample");

The act of opening a bucket has lots of small steps that need to be completed, but the first and most important step is to fetch an initial, good configuration from one of the nodes in the cluster.

The ConfigurationProvider builds up a chain of config loaders, where each loader has a different mechanism of trying to get a configuration. Currently there are two different loaders in place:

– CarrierLoader: tries to fetch the configuration via the memcache binary protocol from the KV service (port 11210 by default).
– HttpLoader: tries to fetch the config from the cluster manager (port 8091 by default).

The CarrierLoader takes precedence since it is the more scalable option (the KV service is able to handle much higher requests per second than the cluster manager) and if it is not available the SDK will fall back to the HttpLoader . One of the reasons why it could fail is that certain (older) Couchbase Server releases don’t support this mechanism. Other failures in general include port 11210 being blocked by firewalls (which leads to other problems down the road anyways…) or sockets timing out.

The HttpLoader itself also has two HTTP endpoints to fetch from (depending on Couchbase Server version again for backwards compatibility) one that provides a more terse config and one that provides a verbose one. The verbose one is the one that has been there since the Couchbase Server 1.8/2.0 days and is guaranteed to work, but the newer features are likely not exposed there.

You get the theme here: the newest/most scalable approaches are tried first with fallback options for older clients and custom configurations. In general the cluster manager is always responsible for managing and providing the current up-to-date config, but it is sent to the KV service as well so that the SDKs can retrieve it with a special command.

As a special case for older servers, if the config was loaded through the HttpLoader , the SDK will attach a streaming connection to one of the HTTP endpoints to receive config updates. If the CarrierLoader was successful, we’ll keep polling with a configurable polling interval for a config via the binary protocol instead since it’s more resource efficient on the server side (the KV service is much better suited to handle potentially lots of requests than the cluster manager, which is why the carrier loading was implemented in the first place).

We’ll cover the different error cases later on, for now assume we were able to get a valid configuration from one of the loaders. At this point it is sent back to the configuration provider which is then checked and sent into the RequestHandler.

It is possible that the configuration provider is not forwarding the configuration, the main reason being that it is outdated or not new enough. Every configuration has a revision that comes with it and the SDK internally tracks this revision number to only propagate configurations which are newer than the one already processed. During a rebalance operation this can happen quite often, and if you have debug logging enabled you can spot this by looking for a log line with the message: Not applying new configuration, older or same rev ID .

As soon as the RequestHandler gets a new config pushed it starts a reconfiguration. The reconfiguration has only one job: to take the configuration provided by the server and change the internal SDK’s state to mirror it. This can mean adding or removing node references, service endpoints or sockets. The SDK mirrors the common language of the server, so it manages Nodes which provide Services that can have one or more Endpoints.

If you enable debug logging, you can see during bootstrap how the different nodes, services and endpoints are enabled based on the configuration provided:

[cb-computations-8] 2018-06-20 17:43:55 DEBUG RequestHandler:435 - Starting reconfiguration.
[cb-computations-8] 2018-06-20 17:43:55 DEBUG RequestHandler:527 - Starting reconfiguration for bucket travel-sample
[cb-computations-8] 2018-06-20 17:43:55 DEBUG RequestHandler:292 - Node NetworkAddress{localhost/127.0.0.1, fromHostname=false, reverseDns=true} already registered, skipping.
[cb-computations-8] 2018-06-20 17:43:55 DEBUG RequestHandler:352 - Got instructed to add Service BINARY, to Node NetworkAddress{localhost/127.0.0.1, fromHostname=false, reverseDns=true}
[cb-computations-8] 2018-06-20 17:43:55 DEBUG Node:273 - [127.0.0.1/localhost]: Adding Service BINARY
[cb-computations-8] 2018-06-20 17:43:55 DEBUG Node:276 - [127.0.0.1/localhost]: Service BINARY already added, skipping.
...

[cb-computations-8] 2018-06-20 17:43:55 DEBUG RequestHandler:435 - Starting reconfiguration.

[cb-computations-8] 2018-06-20 17:43:55 DEBUG RequestHandler:527 - Starting reconfiguration for bucket travel-sample

[cb-computations-8] 2018-06-20 17:43:55 DEBUG RequestHandler:292 - Node NetworkAddress{localhost/127.0.0.1, fromHostname=false, reverseDns=true} already registered, skipping.

[cb-computations-8] 2018-06-20 17:43:55 DEBUG RequestHandler:352 - Got instructed to add Service BINARY, to Node NetworkAddress{localhost/127.0.0.1, fromHostname=false, reverseDns=true}

[cb-computations-8] 2018-06-20 17:43:55 DEBUG Node:273 - [127.0.0.1/localhost]: Adding Service BINARY

[cb-computations-8] 2018-06-20 17:43:55 DEBUG Node:276 - [127.0.0.1/localhost]: Service BINARY already added, skipping.

...

You can also see that it just ignores certain operations if the desired outcome is already achieved. You’ll see most of these messages when new configurations arrive, that is during bootstrap, rebalance or failover.

Reconfiguration, much like rebalance on the server, is an “online” operation. This means that in-flight operations are not affected most of the time and if they are they are rescheduled and put back into the ringbuffer until they succeed or time out.

As soon as the first reconfiguration has succeeded our bucket open request is completed and the Bucket handle is returned on the synchronous API. Herein lies a non-obvious tradeoff: we could return the open bucket request right away but would not catch any errors during bootstrap. Waiting for the first reconfiguration to complete allows us to catch more errors right away, but also means that the more nodes are added to the cluster the longer it will take – make sure to measure the bootstrap performance for your cluster. The default timeout of 5 seconds should be plenty for most cases, but in slower and big cloud deployments it can make sense to increase this timeout.

You can do this by either modifying the environment:

CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder()
.connectTimeout(TimeUnit.SECONDS.toMillis(10)) // set to 10s from 5
.build();

CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder()

.connectTimeout(TimeUnit.SECONDS.toMillis(10)) // set to 10s from 5

.build();

or directly on the openBucket call:

Bucket bucket = cluster.openBucket("travel-sample", 10, TimeUnit.SECONDS);

1	Bucket bucket = cluster.openBucket("travel-sample", 10, TimeUnit.SECONDS);

Bootstrap Failures and Configuration

What happens if one or all of the provided seed hosts are not available? To answer this question we need to take a quick look at how the loader tries to fetch the configurations.

For every host in the seed node list, a primary and fallback loader is assembled and all of them fire off in parallel trying to grab a proper config. Once the first one is found, the observable sequence is unsubscribed and only one config is used going forward. Using this approach rather than sequencing through provides faster bootstrap times but may also lead to more errors in the log if one or more nodes are not reachable. Since we are optimizing for the good case though (in the failure case something is wrong anyways) this seems like an acceptable tradeoff.

In essence, as long as there is a good node to bootstrap from in the seed node list, the SDK will be able to pick up a configuration. Let’s consider a special case where in a MDS setup you instruct the SDK to only bootstrap from a node which does not contain the KV service (let’s say only query and of course the cluster manager).

If you have debug logging enabled, you’ll see the SDK trying to fetch a config from the KV service but it’s being rejected with some error:

[cb-computations-7] 2018-06-21 10:53:48 DEBUG Loader:178 - Could not add service on NetworkAddress{10.142.175.102/10.142.175.102, fromHostname=false, reverseDns=true} because of com.couchbase.client.core.endpoint.kv.AuthenticationException: Bucket not found on Select Bucket command, removing it again.

[cb-computations-7] 2018-06-21 10:53:48 DEBUG Loader:178 - Could not add service on NetworkAddress{10.142.175.102/10.142.175.102, fromHostname=false, reverseDns=true} because of com.couchbase.client.core.endpoint.kv.AuthenticationException: Bucket not found on Select Bucket command, removing it again.

The specific error doesn’t matter, but later in the log you can see how the loader switches to the HTTP fallback which is available:

[cb-computations-2] 2018-06-21 10:53:48 DEBUG Loader:196 - Successfully enabled Service CONFIG on Node NetworkAddress{10.142.175.102/10.142.175.102, fromHostname=false, reverseDns=true}
[cb-computations-2] 2018-06-21 10:53:48 DEBUG HttpLoader:69 - Starting to discover config through HTTP Bootstrap
...
[cb-computations-4] 2018-06-21 10:53:48 DEBUG HttpLoader:93 - Successfully loaded config through HTTP.
[cb-computations-4] 2018-06-21 10:53:48 DEBUG Loader:203 - Got configuration from Service, attempting to parse.

[cb-computations-2] 2018-06-21 10:53:48 DEBUG Loader:196 - Successfully enabled Service CONFIG on Node NetworkAddress{10.142.175.102/10.142.175.102, fromHostname=false, reverseDns=true}

[cb-computations-2] 2018-06-21 10:53:48 DEBUG HttpLoader:69 - Starting to discover config through HTTP Bootstrap

...

[cb-computations-4] 2018-06-21 10:53:48 DEBUG HttpLoader:93 - Successfully loaded config through HTTP.

[cb-computations-4] 2018-06-21 10:53:48 DEBUG Loader:203 - Got configuration from Service, attempting to parse.

One implication of this is that even if on another node the KV service may be available, in this case the HTTP loader will still instruct the streaming connection to open to fetch subsequent configurations.

It is possible to customize the bootstrap behavior through environment settings:

bootstrapHttpEnabled : If the HttpLoader is activated at all (true by default).
bootstrapCarrierEnabled : If the CarrierLoader is activated at all (true by default).
bootstrapHttpDirectPort : Which port the HttpLoader should initially contact if SSL is not enabled (default 8091).
bootstrapHttpSslPort : Which port the HttpLoader should initially contact if SSL is enabled (default 18091).
bootstrapCarrierDirectPort : Which port the CarrierLoader should initially contact if SSL is not enabled (default 11210).
bootstrapCarrierSslPort : Which port the CarrierLoader should initially contact if SSL is enabled (default 11207).

These settings have sane defaults and in general do not need to be modified. They can become useful in debugging scenarios or if custom builds inside Couchbase are tested.

Final Thoughts

Usually bootstrap is smooth and resilient to individual node failures, but if something goes wrong and the bucket doesn’t open properly the first order of action is always to look at the logs at INFO/WARN level and if possible DEBUG logging is enabled. This usually provides insight into which sockets did not connect properly or where in the bootstrap phase the client had to stop (because of errors or timeouts).

Also remember that if you run large clusters, specifying a longer bucket open time than the usual 5 seconds might make sense if the network is slow or has a higher latency variance than usual.

Michael Nitschinger

Share this article

2 Comments

Susantha Bathige October 4, 2018 at 12:24 pm

Nice article! Thanks.
Just curious about what logs at server side has the bootstrap info. Is it “memcached.log”? Appreciate if you could shed more light into this. Thanks.

Log in to Reply
Michael Nitschinger October 4, 2018 at 12:32 pm

Yes, the memcached.log on the server side has some of the information. Although it is not logging a lot (other than the newer “hello” command) since it could be very spammy.

Anything specifically you are looking for?

Log in to Reply

Platform

Self-Managed

Services

Capabilities

By Use Case

By Industry

Popular Docs

Quickstart

Resource Center

About

Partnerships

Inside the Java SDK: Bootstrap

First Steps

Getting a Config

Bootstrap Failures and Configuration

Final Thoughts

Get Couchbase blog updates in your inbox

Author

Posted by Michael Nitschinger, Principal Software Engineer, Couchbase

2 Comments

Leave a comment Cancel reply

Ready to get Started with Couchbase Capella?

Start building

Use Capella free

Get in touch