Ensuring High Availability with Automatic Failover for App Services

In modern applications, continuous connectivity is key—especially for mobile apps relying on backend services. In this blog, we’ll walk through a Python-based solution that monitors the health of your app service servers and automatically fails over to a secondary server if needed. This sample code uses HTTP health checks and WebSocket connection endpoints to ensure that your application always connects to a healthy service.

Overview

The solution involves two types of endpoints:

1. Health check URLs
  - These endpoints (e.g., https://.../_ping) are polled using HTTP HEAD requests.
  - They determine if the app service server is healthy.
2. Connection endpoints
  - These are the WebSocket URLs (e.g., wss://.../primary) that your application uses to interact with the backend.
  - The active connection endpoint is updated based on the health check results.

If the primary server’s health check fails consecutively, the failover logic will switch the application’s connection to the secondary server.

The code in detail

Below is the complete code with inline comments and detailed explanations:

import logging
import threading
import requests
from time import sleep

# Configure logging to show time-stamped messages at INFO level.
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s: %(message)s')

# --------------------------------------
# Health Check URLs (App Service Servers)
# --------------------------------------
# These URLs are used for health checking the servers by sending HEAD requests.
health_check_urls = {
    "primary": "https://XXXXXXXXXXXXXX.apps.cloud.couchbase.com:4984/_ping",
    "secondary": "https://XXXXXXXXXXXXXX.apps.cloud.couchbase.com:4984/_ping"
}

# -------------------------------------
# Connection Endpoints (WebSocket URLs)
# -------------------------------------
# These endpoints are what your application actually uses for connections.
connection_urls = {
    "primary": "wss://XXXXXXXXXXXXXX.apps.cloud.couchbase.com:4984/primary",
    "secondary": "wss://XXXXXXXXXXXXXX.apps.cloud.ucouchbase.com:4984/secondary"
}

# The variable `active_cluster` tracks which server is currently active.
active_cluster = "primary"

# This variable holds the actual WebSocket URL used by your application.
active_connection_url = connection_urls[active_cluster]

def is_cluster_healthy(url):
    """
    Perform a health check using a HEAD request against the provided URL.
    Returns True if the response status is 200; otherwise, returns False.
    Logs the status code and headers for troubleshooting.
    """
    try:
        response = requests.head(url, timeout=5)
        logging.info(f"Health Check Response for {url}")
        logging.info(f"  Status Code: {response.status_code}")
        logging.info("  Headers:")
        for header, value in response.headers.items():
            logging.info(f"    {header}: {value}")

        if response.status_code == 200:
            logging.info(f"{url} is healthy!")
            return True
        else:
            logging.warning(f"{url} might be unhealthy or unreachable.")
            return False
    except requests.exceptions.RequestException as e:
        logging.error(f"Health check failed for {url}: {e}")
        return False

def health_check_worker():
    """
    A background worker that checks the health of the active server every 3 seconds.
    If the active server fails health checks for more than 9 consecutive times,
    the worker attempts to switch to the other server.
    """
    global active_cluster
    global active_connection_url

    consecutive_failures = 0

    while True:
        sleep(3)  # Wait 3 seconds between checks.

        # Use the HTTP health check endpoint for the active cluster.
        current_health_url = health_check_urls[active_cluster]
        logging.info(f"Health check: Checking {active_cluster} at {current_health_url}...")

        if is_cluster_healthy(current_health_url):
            consecutive_failures = 0  # Reset counter if healthy.
        else:
            consecutive_failures += 1
            logging.warning(f"{active_cluster} health check failed {consecutive_failures} time(s).")

            # If failures exceed 9 consecutive attempts, try to fail over.
            if consecutive_failures > 9:
                logging.error(f"{active_cluster} is considered down. Attempting to fail over...")

                # Determine the new active cluster.
                new_cluster = "secondary" if active_cluster == "primary" else "primary"
                new_health_url = health_check_urls[new_cluster]

                # Check if the new cluster is healthy.
                if is_cluster_healthy(new_health_url):
                    active_cluster = new_cluster
                    # Update the connection endpoint.
                    active_connection_url = connection_urls[new_cluster]
                    logging.warning(f"Switched active cluster to {new_cluster}.")
                    logging.warning(f"New WebSocket connection endpoint: {active_connection_url}")
                else:
                    logging.critical("Both clusters appear to be down!")

                consecutive_failures = 0  # Reset the failure counter after the attempt.

def main():
    """
    Main function to start the health-check worker in a background thread.
    Keeps the script running indefinitely until interrupted.
    """
    thread = threading.Thread(target=health_check_worker, daemon=True)
    thread.start()

    logging.info("Health check worker started. Press Ctrl+C to exit.")
    logging.info(f"Application will initially connect to: {active_connection_url}")

    try:
        while True:
            sleep(1)  # Main thread remains alive.
    except KeyboardInterrupt:
        logging.info("Shutting down health check script.")

if __name__ == "__main__":
    main()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

import logging

import threading

import requests

from time import sleep

# Configure logging to show time-stamped messages at INFO level.

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s: %(message)s')

# --------------------------------------

# Health Check URLs (App Service Servers)

# --------------------------------------

# These URLs are used for health checking the servers by sending HEAD requests.

health_check_urls = {

"primary": "https://XXXXXXXXXXXXXX.apps.cloud.couchbase.com:4984/_ping",

"secondary": "https://XXXXXXXXXXXXXX.apps.cloud.couchbase.com:4984/_ping"

}

# -------------------------------------

# Connection Endpoints (WebSocket URLs)

# -------------------------------------

# These endpoints are what your application actually uses for connections.

connection_urls = {

"primary": "wss://XXXXXXXXXXXXXX.apps.cloud.couchbase.com:4984/primary",

"secondary": "wss://XXXXXXXXXXXXXX.apps.cloud.ucouchbase.com:4984/secondary"

}

# The variable `active_cluster` tracks which server is currently active.

active_cluster = "primary"

# This variable holds the actual WebSocket URL used by your application.

active_connection_url = connection_urls[active_cluster]

def is_cluster_healthy(url):

"""

Perform a health check using a HEAD request against the provided URL.

Returns True if the response status is 200; otherwise, returns False.

Logs the status code and headers for troubleshooting.

"""

try:

response = requests.head(url, timeout=5)

logging.info(f"Health Check Response for {url}")

logging.info(f" Status Code: {response.status_code}")

logging.info(" Headers:")

for header, value in response.headers.items():

logging.info(f" {header}: {value}")

if response.status_code == 200:

logging.info(f"{url} is healthy!")

return True

else:

logging.warning(f"{url} might be unhealthy or unreachable.")

return False

except requests.exceptions.RequestException as e:

logging.error(f"Health check failed for {url}: {e}")

return False

def health_check_worker():

"""

A background worker that checks the health of the active server every 3 seconds.

If the active server fails health checks for more than 9 consecutive times,

the worker attempts to switch to the other server.

"""

global active_cluster

global active_connection_url

consecutive_failures = 0

while True:

sleep(3) # Wait 3 seconds between checks.

# Use the HTTP health check endpoint for the active cluster.

current_health_url = health_check_urls[active_cluster]

logging.info(f"Health check: Checking {active_cluster} at {current_health_url}...")

if is_cluster_healthy(current_health_url):

consecutive_failures = 0 # Reset counter if healthy.

else:

consecutive_failures += 1

logging.warning(f"{active_cluster} health check failed {consecutive_failures} time(s).")

# If failures exceed 9 consecutive attempts, try to fail over.

if consecutive_failures > 9:

logging.error(f"{active_cluster} is considered down. Attempting to fail over...")

# Determine the new active cluster.

new_cluster = "secondary" if active_cluster == "primary" else "primary"

new_health_url = health_check_urls[new_cluster]

# Check if the new cluster is healthy.

if is_cluster_healthy(new_health_url):

active_cluster = new_cluster

# Update the connection endpoint.

active_connection_url = connection_urls[new_cluster]

logging.warning(f"Switched active cluster to {new_cluster}.")

logging.warning(f"New WebSocket connection endpoint: {active_connection_url}")

else:

logging.critical("Both clusters appear to be down!")

consecutive_failures = 0 # Reset the failure counter after the attempt.

def main():

"""

Main function to start the health-check worker in a background thread.

Keeps the script running indefinitely until interrupted.

"""

thread = threading.Thread(target=health_check_worker, daemon=True)

thread.start()

logging.info("Health check worker started. Press Ctrl+C to exit.")

logging.info(f"Application will initially connect to: {active_connection_url}")

try:

while True:

sleep(1) # Main thread remains alive.

except KeyboardInterrupt:

logging.info("Shutting down health check script.")

if __name__ == "__main__":

main()

Key technical points

- Health checks on App Service Servers:
  The code separates the health-check endpoints (used for monitoring) from the connection endpoints (used by your application). This allows you to check server health independently while maintaining a stable connection endpoint.
- HTTP HEAD Requests:
  Using HEAD requests to the/_ping endpoint minimizes data transfer while still providing status codes and headers for diagnostics.
- Background Thread:
  The health_check_worker runs in its own daemon thread, allowing continuous health monitoring without blocking the main application thread.
- Failover Logic:
  - A counter (consecutive_failures) tracks consecutive failures.
  - If the count exceeds a set threshold (9 failures), the script attempts a failover by checking the health of the alternate server.
  - Upon a successful health check on the secondary server, the active connection endpoint is updated.
- Logging:
  Detailed logging provides insights into the health check process, including HTTP response status, headers, and failover events. This aids in troubleshooting and monitoring.

Adapting for your application

- You can easily translate and adapt this code to your preferred programming language such as Swift and Kotlin to fit your application’s needs.
- You might integrate this script or logic into your mobile code (iOS/Android) or a backend service that updates the active endpoint.
- If you are on iOS or Android, consider how often and where you run this code. For example, background tasks or push notifications can trigger health checks in a mobile context.
- If you have a microservice architecture, you might run this failover logic in a small service that exposes a current active URL to the mobile apps, so they always connect to the correct WSS endpoint.

Conclusion

This sample code provides a straightforward yet powerful mechanism for ensuring high availability in applications by automatically failing over to a backup server when the primary becomes unreachable. By separating the health checks from the connection endpoints, the application ensures that it always connects to a healthy server via WebSocket.

In a production environment, you may need to adapt and extend the logic to suit your specific requirements, network conditions, and security policies.

Implementing this logic in your mobile application or backend service can greatly improve uptime and resilience, ensuring your users remain connected even during unexpected service interruptions.

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

Quickstart

Resource Center

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

Ensuring High Availability with Automatic Failover for App Services

Overview

The code in detail

Key technical points

Adapting for your application

Conclusion

Author

Posted by Nishant Bhatia - Cloud Architect

Leave a reply Cancel reply