Most databases are configured and sized correctly based on the information available at the time of their initial launch, but they tend to become unhealthy or undersized due to organic growth of their datasets and/or changes in their workload profiles. Couchbase is a distributed database that can easily deal with both challenges, but it needs regular health checking to find the best ways to tune its performance and add more capacity when absolutely necessary. This blog explains what a Couchbase cluster health check is and what are the best practices to follow while preparing for and conducting a health check.
What is a health check?
Couchbase offers several paid service packages under the CoE (Center of Excellence) wing of its Professional Services. One of those service packages is a health check. The result of a health check is a report that is delivered in basic form as free and in comprehensive form as paid. The former is a subset of the latter. In brief, this report is an assessment of the overall health of a Couchbase cluster. It is meant for finding issues that result in poor health. A comprehensive report also includes recommendations to address or resolve the found issues.
Why get health-checked?
Health check recipients have many benefits ranging from performance tuning to capacity planning to reduction in TCO.
Oftentimes, a Couchbase cluster is sized and configured correctly based on data available in the beginning of a deployment, but over time, datasets can grow in size, data access patterns can change and overall workload on the cluster can change making the cluster unhealthy. A cluster in poor health can exhibit one of more of the following symptoms.
- High resource utilization
- Slow response times
- Poor end-use experience
- OOM Killer action
- Frequent failovers
- Lack of resilience
The root causes behind these symptoms get identified and analyzed during a health check. As said earlier, recommendations to address the underlying issues are made for a paid health check service.
Health checks also help cluster owners with preparing for upcoming workload increases that arise during seasonal peaks e.g. around Black Friday or Cyber Monday or during the shopping season ahead of the holidays. Workloads can also increase during new product releases or special offers in the retail space.
Additional benefits include improved resource utilization, reduction in production issues and potential reduction in TCO and compliance with security policies and/or government regulations.
Couchbase CoE experts look at a rich mix of operational data, server logs, workload, sizing and bucket, node, OS and cluster level configurations during a health check. The operational data comprises short and long-term metrics related to client operations, metrics generated by deeper layers of the server software and metrics recorded by the OS such as utilization of IO, memory, network and CPU resources.
In addition, errors and exceptions written in server logs and OS logs get analyzed.
In the basic form, a health check report covers:
- Cluster overview
- Node profiles
- Index definitions (if any)
- Views definitions (if any)
- Issues summary
The comprehensive form adds the following sections to the basic form:
- Detailed information on issues on per node basis
- Recommendations to address those issues
- The customer contacts its Couchbase account team or a Couchbase partner requesting a health check.
- Person(s) receiving the customer’s request, identify the cluster(s) that need checking. They set expectations for what the scope and outcome of the health check is going to be.
- The customer shares as much information as possible about the use case(s), clients, data flows related to the subject cluster.
- The customer collects Couchbase Server logs following the instructions given here and here. Result of this log collection should be generation of a zip file for each Couchbase node. This file is called a cbcollect. It includes environmental information, OS level logs and logs written by various processes that form Couchbase Server. The format or contents of the cbcollect file should not be changed. Doing so will leave it useless.
- The customer uploads cbcollects into Couchbase’s S3 store, usually by using a cURL command syntax like the following.
curl –upload-file fileN.zip S3Target/customerName/clusterName/
fileN.zip = the name of a cbcollect zip file name
S3Target = https://uploads.couchbase.com
customerName = the customer name that is signed up with Couchbase Technical Support
clusterName = a unique name for the Couchbase cluster being health-checked
Note: the / at the end of the curl command is VERY important. Please don’t forget to add it.
- The customer informs its Couchbase Account team or its Couchbase partner after all the cbcollects are successfully uploaded.
- A Couchbase expert takes over at this time and carries out an extensive analysis of the data bundled in the cbcollects. At the end a health check report is generated.
- Have a clear mutual understanding of what the health check will cover and what it will achieve.
- Provide a complete set of cbcollects. Don’t use dated or partial files.
- Sometimes there are issues with S3 when too many cbcollects are uploaded at once or very large cbcollects are uploaded back-to-back. Such issues can result in file rejection or corruption. Slowing down and spacing out of cbcollects is recommended to avoid these issues.
- Review dataset growth and workload history as part of the health check.
- Review node sizing, bucket sizing and multi-dimensional scaling topology part of the health check.
- Once the report is generated, go over it with the customer before finalizing and sending it to them.
- Get a commitment from the customer for implementing your recommendations against a deadline.
- Follow up to see if recommendations were implemented as discussed.
- Recommend another health check a few weeks or months after recommendations are implemented to see the before and after difference.