In this post, we review some important considerations for planning for business continuity (BC) and disaster recovery (DR).

Business continuity needs careful consideration when using Couchbase as a core business service. Today’s focus is on the application layer and the reasons and implications of impacts on services.

When using Couchbase as a persistent system of record, High Availability (HA), DR, and BC need to be carefully understood to ensure they meet the agreed service levels (SLAs).

In a world where service outages and downtime can severely impact the business, companies must ensure that they are robustly protected to minimize business impact, whether it’s to their internal or external stakeholders and customers.

Furthermore, databases are typically critical to the functioning of the business and central to a company’s application ecosphere. If the database is down, its failure will impact other services. The significance of this impact solidifies why you must protect against unexpected service outages.

Historically businesses were happy with a 99.9 service uptime. Now organizations are looking for six nines or greater (99.9999 or 31 seconds per year). Companies could previously tolerate outages of several hours, but this is no longer the case so understanding the business requirements is imperative.

Before designing a strategy to meet a business requirement, we first need to understand Service Level Agreements (SLAs) and how they are measured.

An SLA is a commitment to your customer to get services running in an agreed timeframe.

Measuring SLAs against what matters

We also need to understand the metrics that availability and SLAs are generally measured against, and there are primarily two:

Recovery Point Objective (RPO)

“How much data can I afford to lose?”

    •   Expressed backward in time from the instant a failure occurs.
    •   Can be specified in seconds, minutes, hours, or days.

Recovery Time Objective (RTO)

“How long can I afford the service to be unavailable?”

    •   How long will it take for the data to be available again?
    •   Function of the extent to which the outage disrupts normal operations and the amount of revenue lost per unit of time as a result of the disaster.
    •   Can be specified in seconds, minutes, hours, or days.

So when we talk about HA/DR and BC, what are we looking to achieve? The capability to restore normal (or near-normal) business operations, from a critical business application perspective, after the occurrence of an incident that interrupts business operations. Essentially, meeting the desired RPO/RTO requirements.

Understand why a service fails

Furthermore, the anatomy of why a service fails needs to be considered, as this will affect how services need protection.

Each of the cited reasons (below) for application/service failure has different impacts and connotations, and frequently these require other solutions, considerations, and constructs to ensure complete protection is guaranteed.

Another critical consideration to be reviewed is the misnomer that service outages only affect direct revenue loss; this is typically not the case, as many systems are not revenue generators. If we broaden this out, there are many more reasons to have business continuity solutions in place:

    • Reputational or brand damage
    • Loss of business to rival company or provider
    • Loss of productivity – teams unable to fulfill their functions and services internally
    • Financial penalties from regulatory boards – possibility of not being allowed to trade
    • Death! Hospital/medical systems failing causing cancellations of operations/treatments
    • Impact to other internal services

 

Mitigation options

So, what are the options to protect and mitigate against an application service outage:

    • Clustering – multiple nodes to avoid a single point of failure
    • Replication – ensuring applications and data are available in multiple locations and geographies
    • Backup – to recover from catastrophic incidents

Each of these options can help protect against service outage and recovery of normal business services. And each of these has different RPO & RTO implications which need to be factored into SLAs required by the business.

One of Couchbase’s key tenets, our DNA if you like, is that we are designed to be highly available, provide resilience and ensure that SLAs are met.

Couchbase offers all three of these solutions (clustering, replication, backups), which are fully architected and integrated to mitigate against service outages and minimize downtime.

Strategic availability 

Remember, choosing the correct availability strategy will have a big impact on availability and SLAs being met. It is crucial to understand and define the required SLAs. 

It is better to get the initial strategy correct rather than revisiting following a service outage.

You will need to be realistic about your recovery timeframes while considering the cost implications and who will fund this.

The first step is to understand the business objectives and applications needs. From there, investigate what will meet your SLAs and the enterprise’s goals.

Next time, we will look at how Couchbase can make solutions highly available with clustering.

 

Author

Posted by Steve Grimwood, Solutions Engineer

Leave a reply