As part of the Couchbase Server 7.1 release, Couchbase Analytics Service is very excited to announce the support of High Availability capability availability to ensure users can access data within Analytics Service collections/datasets when one or more Analytics nodes is not available. 

Customer pain point

The primary challenge that users would encounter when one or more Analytics would be failed over or down for scheduled maintenance, security patches, and/or node upgrades, the data residing in Analytics Service nodes would not be fully available for querying or reporting which lead to business users unable to meet service level agreements (SLA’s). Additionally, when the failed node got back online, it would have to rebuild the collections/datasets again from underlying Data Services including indexes. This would result in operational inefficiencies and impact the time to insights.

How does the High Availability work?

Each node running the Analytics Service has one or more data partitions. Data ingested by the Analytics Service is hash-partitioned across all data partitions. When data is ingested in each partition, it is initially stored in an in-memory b-tree. Once a certain memory threshold is reached, the b-tree is persisted to disk and is scheduled to be asynchronously replicated to one or more Analytics nodes (based on the number of replicas configured). When a node running the Analytics Service is failed over, one of its replicas is promoted to serve the partitions that were served by the failed over node. This will allow the Analytics service to continue to work after the failover. The portion of data that will have to be re-ingested from the Data Service will be determined by the state of the replica at the promotion time as follows:

Replica State Data to re-ingest
All LSM components replicated and the failed over node had no in-memory data. None
All LSM components replicated and the failed over node had in-memory data. Only the in-memory data (similar to a node restart)
Some LSM components were not replicated Start from last replicated LSM component 

The Analytics service will continue to work in an unbalanced state until one of the following is performed:

    1. Node Recovery: The failed over node will be resynced from the promoted replica then it will be the master for its storage partitions again.
    2. Node Removal: If the failed over node is removed, Analytics will redistribute the data among the remaining nodes in the cluster.

For HA to work, enter at least 1 or more replicas to be created. 

This can be configured in the server workbench under Settings (see snap image below). This configuration can also be achieved by calling this API. A Rebalance is required to be run for the change to take effect. See this new feature in action in the following video:

High availability business benefits

Now that we know how High Availability capability works in Analytics services, here are the key benefits:

    1. Always available and always on real-time data with increased reliability
    2. Minimal disruption in time to insights when a failed-over node(s) is being recovered, rebuilt, and rebalanced
    3. No impact on time to insights for reports and analytical queries improving customer experience

Summary

I hope you are excited about this much-desired feature request for Couchbase Analytics to be highly available using analytics replicas. Now, your analytics data will be always on and always available to continuously query and analyze your near real-time data without disruption. 

Below are a list of resources for you to get started and we look forward to your feedback on Couchbase Forums

Resources

Author

Posted by Murtadha Al Hubail, Principal Software Engineer, Couchbase

Murtadha is a Principal Software Engineer working on Couchbase Analytics, focusing on its storage engine and high availability.

Leave a reply