CCCP: Cluster Configuration Carrier Publication
Matt Ingenthron <firstname.lastname@example.org>
30 July 2013
SDK Development Team, Cluster Development Team, Sync Gateway Development Team
email@example.com mailing list
Introduce a new method of publishing the configuration to client libraries designed to access the cluster. This project delivers a new public interface and documented guidance for how client libraries and proxies are to initially understand the topology of the cluster, get updates as the cluster topology changes and introduce a feature which would protect data integrity in certain complex failure situations.
Current configuration publishing is reaching some limits in its current implementation though it has served us reasonable well over the last three years. While the implementation could be improved, a new architectural approach could give deployments higher reliability, less impact to applications during changes to cluster topology and new features in the area of reliably operating a cluster in certain complex failure situations.
As mentioned above, we do have a number of issues with our current approach.
The largest and most visible of these is that even moderate numbers of configuration streams overwhelm individual servers and could easily overwhelm a cluster, mostly due to the large amount of cluster wide state we include in that response. Maintaining an additional HTTP connection does also unnecessarily complicate the client. It also is not actually the most efficient route to communicate changes, as currently a configuration change is frequently communicated via a not_my_vbucket response, with which the client then has to retry the operation after receiving the configuration either proactively or by asking via an HTTP channel which can cost 10s or 100s of milliseconds. Also, client libraries currently have to have rules or guess if the configuration service is unavailable when they begin to see data protocol operations become problematic.
This project would move the minimal configuration required for a client configuration to publication via memcached protocol.
Further details on the anticipated implementation are in the appropriate section below.
This project assumes that the existing HTTP stream may be deprecated, but not removed as a supported public interface. For purposes of keeping the cluster reliably functioning, changes were introduced in version 2.1.0 that will relieve pressure by pushing back on client requests and it's expected that this can keep the HTTP stream a reasonably-reliable, though not necessarily scalable, interface through the 2.x series.
This project also does not cover any needed changes to moxi and assumes that moxi, which ships as both a bundled and unbundled component of Couchbase Server, will continue to use the HTTP stream until it makes sense to change within moxi's lifecycle.
As is currently the case in Couchbase clusters, it is assumed that the configuration received from any one node is safe to use. Client libraries need not do any validation.
Couchbase has moved into larger and larger deployments. Some of this has occurred at the same time Couchbase, starting with the 1.8 release, has started to prefer client libraries which have a more direct interaction with the cluster. This was done to provide higher reliability and performance for our customers, though we knew it would require some changes to their applications.
As described above, there are some problems with our current implementation of HTTP streaming that cause performance and stability issues as we scale clusters.
The market is any deployment in which there are a large number of uncoordinated clients. This can be a large number of Java or .NET clients or, owing to it's multiprocessing model, could be deployments typically using libcouchbase with many processes per system. This can also help with any size deployment since it simplifies fault detection and communication of changes.
A set of QE driven tests will be carried out that demonstrates that a reasonably sized cluster can handle two scenarios.
One is thousands of client library instances making and receiving configuration updates with no more than a 5% difference in resource utilization. The actual resource utilization is expected to be much less.
A second test that should be performed is one in which the network is interrupted between hundreds or thousands of client-library instances and the cluster which would cause the clients to re-establish all configuration and data channels. This test would be successful if all clients are able to reestablish their connections in less than 30 seconds and continue data operations.
This project would introduce a new way to configure client libraries. It would be a new public interface at the cluster and add new API to use it from client libraries.
With this new interface, the client libraries will retrieve the bootstrap configuration for Couchbase buckets from one or more nodes in the cluster. memcached bucket types are not covered by this proposal and would return an error if a configuration request were received.
This would consist of a new command opcode to be used either before authentication (in a "default" bucket situation) or after authentication to get the configuration for a given bucket. It would also introduce behavior of sending new configuration along with any responses that indicate in a request that the client does not know the current topology as a rider on a modified version of the existing NOT_MY_VBUCKET responses.
The client library would randomly select a host from which to request an initial configuration, and then subsequent to that will update the cluster map it is using based on on NOT_MY_VBUCKET responses. In the configuration it must also retrieve the UUID associated with the bucket for any future reconfiguration events, so it is expected that that bucket instance UUID is available in the configuration request and response. It is expected that for the duration of this client library instance, the bucket and UUID will be used and verified. However, the developer/administrator are not expected to provide the UUID, and the client library is not expected to store this across the creation of (memory volatile) client library instances.
An architectural note, it is assumed that the configuration from any node of the cluster is to be trusted as an initial starting point, or that node will not be available. If the node is 'partially' available, meaning it allows connections to be established, it must not allow authentication or configuration requests if that configuration request cannot be properly serviced. Any error for authentication or configuration requests will cause the client to request a configuration from another node in the list supplied by the developer/administrator. Since the initial configuration request is sent to a random node in the list specified by the application developer/administrator, it will be critically important that the developer/administrator maintain any application deployment with a list of trusted servers.
It is not proposed in this project that any existing bootstrap methods be removed from either clients or servers. Where existing bootstrap methods are hostname or hostname:port based, the semantics would change to try the new configuration method first, and then fail back to older behavior. Where existing bootstrap methods are URI based, they will use the old configuration methods. All client libraries implementing this proposal (and all future client libraries) will have similar behavior for bootstrapping, as described above.
Sync Gateway issue 129 - change to support new config method
Create a new RESTful endpoint that has only a small amount of data, and thus is lighter weight on the cluster.
This could well work, but does not decrease the latency with which configuration changes are expected to be propagated or increase reliability.
Changes to ns-server, ep-engine, possibly memcached. Changes to libcouchbase, the Java client, the .NET client.
Changes to moxi to use this new configuration method.
Any configuration which allows for a load balancer for traffic in this new approach.
Any change in logic to how view requests are distributed. These would continue to operate as they currently do against the same configuration.
Any attempt to compensate for misconfigured or unusual DNS/hostname configurations. In the current implementation, the cluster and cluster configuration control node hostnames (or IP addresses, which we unfortunately default to) and publish this via the configuration. A consistent set of name resolution for the hosts participating in the cluster and client-side application logic are expected.
A new set of operations at the memcached protocol level along with a documented way to use those interfaces from a client library.
Updates to each client library's documentation on how to use the new configuration bootstrap method. An internal or external document that describes how to use this bootstrap method by new client libraries.
To migrate to the new approach, software developers may need to make changes to their client configuration and/or software. The specific changes are not to be covered in the scope of this document.
There is no administration impact anticipated, other than possibly needing to roll out the new version of the cluster and new client libraries.
New dot-minor versions of all components mentioned would be required. Given that no existing interfaces are expected to be removed, there is no additional impact over a regular dot-minor upgrade.
This change reduces dependency on HTTP for getting the configuration, and doesn't really enhance security but removes one more item from the scope of things to be concerned with. There is a separate project for addressing credential and transport security for any memcached protocol needs.
None outside the mentioned components. This spans many components on both the client and the server.
Work under MB-8578 to track bucket UUIDs at ep-engine would be required, but this is already completed.
None at this time. Should link to private commands in memcached and public commands in memcached, but it's not clear where those are documented outside the code. Need assistance from ep-engine/memcached teams.
This project would be delivered in Couchbase Server 3.0.
Estimated as a week or two of engineering time for across server components and then an equivalent week or two for clients. The QE effort is likely higher and no estimate is provided for in this draft.
It'd be expected that builds in leading to 2.2 or separate builds would deliver the first prototype, likely with reference implementation and documentation being lead by the libcouchbase project.
Since this is a committed change, cost has not been estimated.
Removed section on polling for updated configuration to support memcached bucket types. 2013-07-24
Removed notion of quorum. 2013-07-24
Added note about change in view request distribution remaining the same. 2013-07-24
Noted that initial configuration requests will be randomly distributed over the cluster. 2013-07-25
Added a note that this helps with even small deployments. 2013-07-25
Additional review and simplification. 2013-07-27
Fixed scope on memcached bucket. 2013-07-30
Published as ACCEPTED 2013-07-30