All Your Environments are Belong to Us!
What Couchbase certifies
Commodity Cloud platforms are used by the Couchbase Autonomous Operator team to test the everything works. These platforms are leveraged due to ubiquity within the Kubernetes ecosystem and their ability to be provisioned and de-provisioned in a simple, generic manner. Amazon EKS, Google GKE, Microsoft AKS and Red Hat OpenShift (on AWS) are our main combinations. These platforms are –mostly– readily able to be created on-demand with generic technologies like HashiCorp Terraform.
A small subset, no?
The short answer is both yes, and no. On the one hand, certification of these big-four platforms covers 99% of users (a made-up number!). This gives us the most bang for the buck from a purely business perspective because testing doesn’t come for free, both in terms of time and money.
On the other hand, dozens of different platform vendors can be used, all certified under the umbrella of the CNCF software conformance program. Kubernetes conformance guarantees that a pod will be a pod, and behave like a pod wherever it runs, thus breaking vendor lock-in to a certain degree. In fact, given these constraints and if the Operator is only using generic resources, there is a high degree of certainty that the Operator will run on any of these platforms with no modification necessary.
So self-certification isn’t necessary, right?
Wrong. When I said “certain degree” earlier, that was deliberate. While the core of Kubernetes is well defined in format and behavior, certain parts can be customized for a specific platform, namely storage and networking, both of which are pretty fundamental to the operation of a distributed database.
Therefore, it is understandable when end-users want their specific platform to be certified for operation with Couchbase. In the past, we’ve endeavored to certify platforms when requested. For some platforms, like Rancher, this is simple. Suppose we have to deal with a $1,000,000 storage appliance. In that case, it’s financially infeasible and requires a data center, servers, switches, routers, power, cooling, etc. Obviously, not a practical and scalable solution. This is particularly acute when your platform happens to be a ship!
Self-certification is the answer to all the problems with platform certification. Looking at the problem, if we were to certify Acme Cloud, we’d provision Kubernetes on it, then run our test suite against it using native APIs.
The logical next step is to take that test suite and package it up as a container for anyone on any Kubernetes platform. That’s all there is to it; what we are providing to you is what we use internally for all our certification efforts. Distributed certification!
Anecdotally, it wasn’t that easy. We had to fundamentally change the network model (making it simpler and faster in the process). We also had to be careful with memory utilization, given Kubernetes is a memory-constrained platform.
So what does it do?
Experience has taught me that while most people should treat this as a black box and just run it, it’s the nature of computer engineers to pick things apart and ask lots of (too many?) questions, so I shall be candid. This product may not check all the boxes for the entire target audience.
From a fundamental level, self-certification runs a Kubernetes pod that executes the tests and stores results to a persistent volume. The results are extracted from Kubernetes to be submitted to Couchbase for acceptance.
Unfortunately, the permissions required are quite intrusive, so here is where self-certification may not work for you. The testing will require creating and deleting cluster-scoped resources, like namespaces, custom resource definitions, roles and role bindings. For this reason, we recommend that this tool be run on a non-production, throw-away Kubernetes cluster.
With Couchbase Autonomous Operator 2.3, we’ve coalesced all existing tools into one-tool-to-rule-them-all. Here is where the certification command resides. Tooling is available from the Couchbase downloads Web page.
Running it is as simple as the command line:
$ cao certify
You may want to review the documentation to see if any flags need overriding for your specific environment.
Pre-flight test phase
The first checks the tests will perform are a general health check of the Kubernetes cluster:
couchbase-operator-certification 2.3.0 (build 999, revision 966973b797e9b310c541c84599e2cea79cfd69ef)
INFO Platform Preflight Checks
INFO Number of processes = unlimited (>= 10000) ✔
INFO Number of open files = 1048576 (>= 70000) ✔
INFO Node gke-couchbase-operator-c-default-pool-65e1d52e-28dn = 3920m CPU, 13605192Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-65e1d52e-8bz8 = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-65e1d52e-r9v7 = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-65e1d52e-svtk = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-65e1d52e-wb5r = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-65e1d52e-zwjy = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-77099377-19ee = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-77099377-65dj = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-77099377-kmqm = 3920m CPU, 13605192Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-77099377-n7pe = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-77099377-owzf = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-77099377-v0yb = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-f817c816-0jle = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-f817c816-3kn5 = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-f817c816-dceb = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-f817c816-eaj2 = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-f817c816-oajx = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Node gke-couchbase-operator-c-default-pool-f817c816-uxms = 3920m CPU, 13605200Ki memory (>= 2 CPU, 4Gi memory) ✔
INFO Cluster = 70560m CPU, 244893584Ki memory (>= 50 CPU, 64Gi memory) ✔
The first check is for platform resource limits, as specified by the Couchbase Server non-root installation guide. We’ve seen instances in the past, particularly with CoreOS, where the number of processes is set low (1024) and Couchbase Server cannot actually start. Spotting these errors early allows the user to self-service as an added benefit.
Checking memory and CPU resources
Next up, the platform’s memory and CPU resources are checked. The Couchbase Server system requirements define the minimum resource sizes required for a single Couchbase Server instance. The tests themselves will run with the automatic memory reservation feature activated, thus making scheduling problems easier to debug.
Finally, overall memory and CPU totals are examined. In short, self-certification runs tests in parallel. Knowing the level of concurrency and Couchbase Server’s requirements, we can guess how many resources are required to execute the test. The full calculation is documented in the self-certification concepts documentation.
Next to occur is the Kubernetes cluster setup. This does one-time setup tasks like installing the custom resource definitions, the dynamic admission controller and any other platform clean up operations:
INFO Configuring Cluster 0
INFO Removing node taints
INFO Cleaning-Up Namespaces
INFO Recreating CRD
INFO Deleting admission controller
INFO Recreating docker auth secret in default namespace
INFO Creating admission controller
We’ve already hinted at this; tests run in parallel. If all tests were run one after the other, the time taken to execute everything would span into days. Through concurrency, we can achieve this in 3-4 hours!
Cloud Jedi Masters will know that treating anything like a special snowflake is “doing it wrong.” By preparing for disaster and assuming things will need to be recreated from scratch, you’ll be up and running again in no time, while others panic trying to fix things and interact with support organizations.
As such, rather than managing resource tear down, the tests use Kubernetes namespaces. Each test gets a namespace and creates resources only in that namespace. Once the test is done, the namespace is deleted, and all resources are automatically reclaimed.
Tests generally do a few things to ensure that:
- the Kubernetes APIs work as desired
- our custom resources behave correctly
- tooling works
- the Operator takes the correct action during updates and recovery scenarios
- Couchbase Server behaves consistently across releases
What you run as an end-user will be a subset of tests that covers the vast majority of all supported functionality.
Once all the tests have been run to completion, the self-certification suite will display a results summary:
INFO Suite Summary (overall)
INFO ✔ Passes: 432 (91.91%)
INFO ✗ Failures: 7 (1.49%)
INFO ? Skipped: 31 (6.60%)
The fewer failures, the better! As with all things, certain tests are subject to unavoidable race conditions. I’ve shown you this as an example, but what you run will be a cut-down version of this where only stable and predictable tests are included. Skipped tests are typically those that cannot run given the capabilities of the platform you are testing or certain combinations of parameters.
A second pod will be created, mounting a persistent volume, and the results archive copied over. This will be named something similar to couchbase-operator-certification-20060102T150405-0700.tar.bz2 and contain the test results summary and any logs related to failed tests used to debug any issues.
While you may see a 100% pass rate and want to plow on, it’s not over yet. We expect all self-certification results to be submitted to Couchbase for approval. This gives us insight into what Kubernetes, storage and network platforms are being used by our end users. With this information, we can then advertise these platforms as tried-and-tested and avoid duplication of effort.
Even if you get a few failures, this is not necessarily a roadblock to certification acceptance. It may be that certain features cannot be supported on specific platforms. We can use this information to advertise this to other users and update the self-certification container image and skip these tests in certain circumstances, making the process easier for all.
Operator self-certification is a game-changer for us. It helps us access and support Couchbase users on a wider variety of platforms. Using a collaborative approach, we can share the burden and advertise acceptance to an even larger audience.
We look forward to seeing your results and hearing any feedback you may have.
Follow up by accessing these resources referenced in the article: