Skip to main content

Disaster Recovery and Business Continuity for Cloud-Native Applications: A Comprehensive Strategy

Author
19 min
3855 words
--

AI Summary

This article provides comprehensive insights into "Disaster Recovery and Business Continuity for Cloud-Native Applications: A Comprehensive Strategy", exploring key concepts, practical applications, and future developments to offer readers a thorough understanding of the subject matter.

Content generated by AI

Disaster Recovery and Business Continuity for Cloud-Native Applications: A Comprehensive Strategy

In today’s digital landscape, ensuring business continuity and implementing robust disaster recovery (DR) strategies is critical for maintaining operational resilience. This comprehensive guide explores how to design, implement, and maintain disaster recovery solutions for cloud-native applications, covering everything from backup strategies to automated failover mechanisms.

Understanding Disaster Recovery Fundamentals

Key Metrics and Objectives

Before implementing any DR strategy, it’s essential to understand the key metrics that define your recovery requirements:

  • Recovery Time Objective (RTO): Maximum acceptable downtime
  • Recovery Point Objective (RPO): Maximum acceptable data loss
  • Mean Time to Recovery (MTTR): Average time to restore service
  • Mean Time Between Failures (MTBF): Average time between system failures
graph TB
    subgraph "Disaster Types"
        A[Hardware Failures]
        B[Software Failures]
        C[Human Errors]
        D[Natural Disasters]
        E[Cyber Attacks]
        F[Network Outages]
    end
    
    subgraph "Impact Assessment"
        G[Service Disruption]
        H[Data Loss]
        I[Revenue Impact]
        J[Reputation Damage]
        K[Compliance Issues]
    end
    
    subgraph "Recovery Strategies"
        L[Backup & Restore]
        M[Pilot Light]
        N[Warm Standby]
        O[Multi-Site Active/Active]
    end
    
    A --> G
    B --> H
    C --> I
    D --> J
    E --> K
    F --> G
    
    G --> L
    H --> M
    I --> N
    J --> O
    K --> O

DR Strategy Classification

Different applications require different DR strategies based on their criticality and requirements:

Strategy RTO RPO Cost Complexity Use Case
Backup & Restore Hours to Days Hours Low Low Non-critical applications
Pilot Light 10-30 minutes Minutes Medium Medium Important applications
Warm Standby 1-10 minutes Seconds High High Critical applications
Multi-Site Active/Active Seconds Near-zero Very High Very High Mission-critical applications

Comprehensive Backup Strategy

Kubernetes Backup with Velero

Velero is an open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes.

# velero/install.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: velero
---
apiVersion: v1
kind: Secret
metadata:
  name: cloud-credentials
  namespace: velero
type: Opaque
data:
  cloud: <base64-encoded-credentials>
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: aws-backup-location
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: company-velero-backups
    prefix: production-cluster
  config:
    region: us-west-2
    s3ForcePathStyle: "false"
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: aws-snapshot-location
  namespace: velero
spec:
  provider: aws
  config:
    region: us-west-2

Automated Backup Schedules

# velero/schedules/daily-backup.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
      - production
      - staging
    excludedNamespaces:
      - kube-system
      - velero
    includedResources:
      - "*"
    excludedResources:
      - events
      - events.events.k8s.io
    labelSelector:
      matchLabels:
        backup: "enabled"
    snapshotVolumes: true
    ttl: 720h  # 30 days retention
    storageLocation: aws-backup-location
    volumeSnapshotLocations:
      - aws-snapshot-location
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: weekly-full-backup
  namespace: velero
spec:
  schedule: "0 1 * * 0"  # Weekly on Sunday at 1 AM
  template:
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - kube-system
    snapshotVolumes: true
    ttl: 2160h  # 90 days retention
    storageLocation: aws-backup-location

Database Backup Automation

#!/bin/bash
# scripts/database-backup.sh

set -euo pipefail

# Configuration
DB_HOST="${DB_HOST:-localhost}"
DB_PORT="${DB_PORT:-5432}"
DB_NAME="${DB_NAME:-production}"
DB_USER="${DB_USER:-postgres}"
BACKUP_BUCKET="${BACKUP_BUCKET:-company-db-backups}"
RETENTION_DAYS="${RETENTION_DAYS:-30}"
ENCRYPTION_KEY="${ENCRYPTION_KEY:-/etc/backup/encryption.key}"

# Generate backup filename with timestamp
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="postgresql_${DB_NAME}_${TIMESTAMP}.sql.gz.enc"
LOCAL_BACKUP="/tmp/${BACKUP_FILE}"

# Function to log messages
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >&2
}

# Function to cleanup on exit
cleanup() {
    if [[ -f "${LOCAL_BACKUP}" ]]; then
        rm -f "${LOCAL_BACKUP}"
    fi
}
trap cleanup EXIT

# Create database backup
log "Starting database backup for ${DB_NAME}"
pg_dump -h "${DB_HOST}" -p "${DB_PORT}" -U "${DB_USER}" -d "${DB_NAME}" \
    --verbose --no-password --format=custom --compress=9 \
    | gzip \
    | openssl enc -aes-256-cbc -salt -in - -out "${LOCAL_BACKUP}" -pass file:"${ENCRYPTION_KEY}"

# Verify backup file was created
if [[ ! -f "${LOCAL_BACKUP}" ]]; then
    log "ERROR: Backup file was not created"
    exit 1
fi

# Upload to S3
log "Uploading backup to S3"
aws s3 cp "${LOCAL_BACKUP}" "s3://${BACKUP_BUCKET}/postgresql/${BACKUP_FILE}" \
    --storage-class STANDARD_IA \
    --metadata "database=${DB_NAME},timestamp=${TIMESTAMP}"

# Verify upload
if aws s3 ls "s3://${BACKUP_BUCKET}/postgresql/${BACKUP_FILE}" > /dev/null; then
    log "Backup successfully uploaded to S3"
else
    log "ERROR: Failed to upload backup to S3"
    exit 1
fi

# Cleanup old backups
log "Cleaning up old backups (older than ${RETENTION_DAYS} days)"
CUTOFF_DATE=$(date -d "${RETENTION_DAYS} days ago" +%Y%m%d)
aws s3 ls "s3://${BACKUP_BUCKET}/postgresql/" | while read -r line; do
    BACKUP_DATE=$(echo "$line" | awk '{print $4}' | grep -o '[0-9]\{8\}' | head -1)
    if [[ "${BACKUP_DATE}" < "${CUTOFF_DATE}" ]]; then
        BACKUP_KEY=$(echo "$line" | awk '{print $4}')
        log "Deleting old backup: ${BACKUP_KEY}"
        aws s3 rm "s3://${BACKUP_BUCKET}/postgresql/${BACKUP_KEY}"
    fi
done

log "Database backup completed successfully"

Backup Verification and Testing

// backup/verify.go
package main

import (
    "context"
    "database/sql"
    "fmt"
    "log"
    "os"
    "os/exec"
    "time"
    
    "github.com/aws/aws-sdk-go-v2/config"
    "github.com/aws/aws-sdk-go-v2/service/s3"
    _ "github.com/lib/pq"
)

type BackupVerifier struct {
    s3Client   *s3.Client
    bucket     string
    testDBConn *sql.DB
}

func NewBackupVerifier(bucket string, testDBURL string) (*BackupVerifier, error) {
    // Initialize AWS S3 client
    cfg, err := config.LoadDefaultConfig(context.TODO())
    if err != nil {
        return nil, fmt.Errorf("failed to load AWS config: %w", err)
    }
    
    s3Client := s3.NewFromConfig(cfg)
    
    // Connect to test database
    testDB, err := sql.Open("postgres", testDBURL)
    if err != nil {
        return nil, fmt.Errorf("failed to connect to test database: %w", err)
    }
    
    return &BackupVerifier{
        s3Client:   s3Client,
        bucket:     bucket,
        testDBConn: testDB,
    }, nil
}

func (bv *BackupVerifier) VerifyLatestBackup(ctx context.Context) error {
    // Find the latest backup
    latestBackup, err := bv.findLatestBackup(ctx)
    if err != nil {
        return fmt.Errorf("failed to find latest backup: %w", err)
    }
    
    log.Printf("Verifying backup: %s", latestBackup)
    
    // Download and decrypt backup
    localFile, err := bv.downloadBackup(ctx, latestBackup)
    if err != nil {
        return fmt.Errorf("failed to download backup: %w", err)
    }
    defer os.Remove(localFile)
    
    // Restore to test database
    if err := bv.restoreToTestDB(localFile); err != nil {
        return fmt.Errorf("failed to restore backup: %w", err)
    }
    
    // Verify data integrity
    if err := bv.verifyDataIntegrity(); err != nil {
        return fmt.Errorf("data integrity check failed: %w", err)
    }
    
    log.Printf("Backup verification completed successfully")
    return nil
}

func (bv *BackupVerifier) findLatestBackup(ctx context.Context) (string, error) {
    input := &s3.ListObjectsV2Input{
        Bucket: &bv.bucket,
        Prefix: aws.String("postgresql/"),
    }
    
    result, err := bv.s3Client.ListObjectsV2(ctx, input)
    if err != nil {
        return "", err
    }
    
    var latestKey string
    var latestTime time.Time
    
    for _, obj := range result.Contents {
        if obj.LastModified.After(latestTime) {
            latestTime = *obj.LastModified
            latestKey = *obj.Key
        }
    }
    
    if latestKey == "" {
        return "", fmt.Errorf("no backups found")
    }
    
    return latestKey, nil
}

func (bv *BackupVerifier) downloadBackup(ctx context.Context, key string) (string, error) {
    localFile := fmt.Sprintf("/tmp/backup_verify_%d.sql.gz.enc", time.Now().Unix())
    
    // Download from S3
    input := &s3.GetObjectInput{
        Bucket: &bv.bucket,
        Key:    &key,
    }
    
    result, err := bv.s3Client.GetObject(ctx, input)
    if err != nil {
        return "", err
    }
    defer result.Body.Close()
    
    // Save to local file
    file, err := os.Create(localFile)
    if err != nil {
        return "", err
    }
    defer file.Close()
    
    _, err = io.Copy(file, result.Body)
    if err != nil {
        return "", err
    }
    
    return localFile, nil
}

func (bv *BackupVerifier) restoreToTestDB(backupFile string) error {
    // Decrypt backup
    decryptedFile := backupFile + ".decrypted"
    cmd := exec.Command("openssl", "enc", "-aes-256-cbc", "-d", "-salt",
        "-in", backupFile, "-out", decryptedFile,
        "-pass", "file:/etc/backup/encryption.key")
    if err := cmd.Run(); err != nil {
        return fmt.Errorf("failed to decrypt backup: %w", err)
    }
    defer os.Remove(decryptedFile)
    
    // Decompress
    decompressedFile := decryptedFile + ".sql"
    cmd = exec.Command("gunzip", "-c", decryptedFile)
    output, err := os.Create(decompressedFile)
    if err != nil {
        return err
    }
    defer output.Close()
    defer os.Remove(decompressedFile)
    
    cmd.Stdout = output
    if err := cmd.Run(); err != nil {
        return fmt.Errorf("failed to decompress backup: %w", err)
    }
    
    // Restore to test database
    cmd = exec.Command("pg_restore", "-h", "test-db-host", "-U", "postgres",
        "-d", "test_restore", "--clean", "--if-exists", decompressedFile)
    if err := cmd.Run(); err != nil {
        return fmt.Errorf("failed to restore backup: %w", err)
    }
    
    return nil
}

func (bv *BackupVerifier) verifyDataIntegrity() error {
    // Perform basic data integrity checks
    queries := []string{
        "SELECT COUNT(*) FROM users",
        "SELECT COUNT(*) FROM orders",
        "SELECT COUNT(*) FROM products",
        "SELECT MAX(created_at) FROM audit_log",
    }
    
    for _, query := range queries {
        var result interface{}
        err := bv.testDBConn.QueryRow(query).Scan(&result)
        if err != nil {
            return fmt.Errorf("integrity check failed for query '%s': %w", query, err)
        }
        log.Printf("Integrity check passed: %s = %v", query, result)
    }
    
    return nil
}

func main() {
    verifier, err := NewBackupVerifier(
        "company-db-backups",
        "postgres://postgres:password@test-db-host:5432/test_restore?sslmode=disable",
    )
    if err != nil {
        log.Fatal(err)
    }
    
    ctx := context.Background()
    if err := verifier.VerifyLatestBackup(ctx); err != nil {
        log.Fatal(err)
    }
}

Multi-Region Deployment Strategy

Active-Passive Multi-Region Setup

# infrastructure/multi-region/primary-region.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: region-config
  namespace: kube-system
data:
  region: "us-west-2"
  role: "primary"
  failover_region: "us-east-1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: application
  namespace: production
spec:
  replicas: 5
  selector:
    matchLabels:
      app: application
  template:
    metadata:
      labels:
        app: application
        region: primary
    spec:
      containers:
      - name: app
        image: myapp:latest
        env:
        - name: REGION
          value: "us-west-2"
        - name: DATABASE_URL
          value: "postgres://primary-db.us-west-2.rds.amazonaws.com:5432/production"
        - name: REDIS_URL
          value: "redis://primary-redis.us-west-2.cache.amazonaws.com:6379"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Database Replication Configuration

# infrastructure/database/postgresql-primary.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgresql-primary
  namespace: database
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  
  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "256MB"
      effective_cache_size: "1GB"
      wal_level: "replica"
      max_wal_senders: "10"
      max_replication_slots: "10"
      hot_standby: "on"
      
  bootstrap:
    initdb:
      database: production
      owner: app_user
      secret:
        name: postgresql-credentials
        
  storage:
    size: 100Gi
    storageClass: fast-ssd
    
  monitoring:
    enabled: true
    
  backup:
    retentionPolicy: "30d"
    barmanObjectStore:
      destinationPath: "s3://company-db-backups/postgresql"
      s3Credentials:
        accessKeyId:
          name: backup-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: backup-credentials
          key: SECRET_ACCESS_KEY
      wal:
        retention: "7d"
      data:
        retention: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgresql-replica
  namespace: database
spec:
  instances: 2
  
  bootstrap:
    pg_basebackup:
      source: postgresql-primary
      
  externalClusters:
  - name: postgresql-primary
    connectionParameters:
      host: postgresql-primary-rw
      user: postgres
      dbname: postgres
    password:
      name: postgresql-credentials
      key: password

Cross-Region Data Synchronization

// sync/cross_region_sync.go
package main

import (
    "context"
    "database/sql"
    "encoding/json"
    "fmt"
    "log"
    "time"
    
    "github.com/aws/aws-sdk-go-v2/config"
    "github.com/aws/aws-sdk-go-v2/service/sqs"
    "github.com/go-redis/redis/v8"
    _ "github.com/lib/pq"
)

type CrossRegionSync struct {
    primaryDB   *sql.DB
    secondaryDB *sql.DB
    sqsClient   *sqs.Client
    redisClient *redis.Client
    queueURL    string
}

type SyncEvent struct {
    Type      string                 `json:"type"`
    Table     string                 `json:"table"`
    Operation string                 `json:"operation"`
    Data      map[string]interface{} `json:"data"`
    Timestamp time.Time              `json:"timestamp"`
    Region    string                 `json:"region"`
}

func NewCrossRegionSync(primaryDBURL, secondaryDBURL, queueURL string) (*CrossRegionSync, error) {
    // Connect to databases
    primaryDB, err := sql.Open("postgres", primaryDBURL)
    if err != nil {
        return nil, fmt.Errorf("failed to connect to primary DB: %w", err)
    }
    
    secondaryDB, err := sql.Open("postgres", secondaryDBURL)
    if err != nil {
        return nil, fmt.Errorf("failed to connect to secondary DB: %w", err)
    }
    
    // Initialize AWS SQS client
    cfg, err := config.LoadDefaultConfig(context.TODO())
    if err != nil {
        return nil, fmt.Errorf("failed to load AWS config: %w", err)
    }
    
    sqsClient := sqs.NewFromConfig(cfg)
    
    // Initialize Redis client
    redisClient := redis.NewClient(&redis.Options{
        Addr:     "redis-cluster.cache.amazonaws.com:6379",
        Password: "",
        DB:       0,
    })
    
    return &CrossRegionSync{
        primaryDB:   primaryDB,
        secondaryDB: secondaryDB,
        sqsClient:   sqsClient,
        redisClient: redisClient,
        queueURL:    queueURL,
    }, nil
}

func (crs *CrossRegionSync) StartSyncWorker(ctx context.Context) error {
    log.Println("Starting cross-region sync worker")
    
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        default:
            if err := crs.processSyncEvents(ctx); err != nil {
                log.Printf("Error processing sync events: %v", err)
                time.Sleep(5 * time.Second)
            }
        }
    }
}

func (crs *CrossRegionSync) processSyncEvents(ctx context.Context) error {
    // Receive messages from SQS
    input := &sqs.ReceiveMessageInput{
        QueueUrl:            &crs.queueURL,
        MaxNumberOfMessages: 10,
        WaitTimeSeconds:     20,
        VisibilityTimeoutSeconds: 300,
    }
    
    result, err := crs.sqsClient.ReceiveMessage(ctx, input)
    if err != nil {
        return fmt.Errorf("failed to receive messages: %w", err)
    }
    
    for _, message := range result.Messages {
        if err := crs.processSyncEvent(ctx, *message.Body); err != nil {
            log.Printf("Failed to process sync event: %v", err)
            continue
        }
        
        // Delete message after successful processing
        _, err := crs.sqsClient.DeleteMessage(ctx, &sqs.DeleteMessageInput{
            QueueUrl:      &crs.queueURL,
            ReceiptHandle: message.ReceiptHandle,
        })
        if err != nil {
            log.Printf("Failed to delete message: %v", err)
        }
    }
    
    return nil
}

func (crs *CrossRegionSync) processSyncEvent(ctx context.Context, messageBody string) error {
    var event SyncEvent
    if err := json.Unmarshal([]byte(messageBody), &event); err != nil {
        return fmt.Errorf("failed to unmarshal sync event: %w", err)
    }
    
    log.Printf("Processing sync event: %s %s on table %s", event.Operation, event.Type, event.Table)
    
    switch event.Type {
    case "database":
        return crs.syncDatabaseEvent(ctx, event)
    case "cache":
        return crs.syncCacheEvent(ctx, event)
    case "file":
        return crs.syncFileEvent(ctx, event)
    default:
        return fmt.Errorf("unknown sync event type: %s", event.Type)
    }
}

func (crs *CrossRegionSync) syncDatabaseEvent(ctx context.Context, event SyncEvent) error {
    switch event.Operation {
    case "INSERT":
        return crs.replicateInsert(ctx, event)
    case "UPDATE":
        return crs.replicateUpdate(ctx, event)
    case "DELETE":
        return crs.replicateDelete(ctx, event)
    default:
        return fmt.Errorf("unknown database operation: %s", event.Operation)
    }
}

func (crs *CrossRegionSync) replicateInsert(ctx context.Context, event SyncEvent) error {
    // Build INSERT query dynamically
    columns := make([]string, 0, len(event.Data))
    placeholders := make([]string, 0, len(event.Data))
    values := make([]interface{}, 0, len(event.Data))
    
    i := 1
    for column, value := range event.Data {
        columns = append(columns, column)
        placeholders = append(placeholders, fmt.Sprintf("$%d", i))
        values = append(values, value)
        i++
    }
    
    query := fmt.Sprintf(
        "INSERT INTO %s (%s) VALUES (%s) ON CONFLICT DO NOTHING",
        event.Table,
        strings.Join(columns, ", "),
        strings.Join(placeholders, ", "),
    )
    
    _, err := crs.secondaryDB.ExecContext(ctx, query, values...)
    if err != nil {
        return fmt.Errorf("failed to replicate insert: %w", err)
    }
    
    return nil
}

func (crs *CrossRegionSync) syncCacheEvent(ctx context.Context, event SyncEvent) error {
    switch event.Operation {
    case "SET":
        key := event.Data["key"].(string)
        value := event.Data["value"]
        ttl := time.Duration(event.Data["ttl"].(float64)) * time.Second
        
        return crs.redisClient.Set(ctx, key, value, ttl).Err()
        
    case "DELETE":
        key := event.Data["key"].(string)
        return crs.redisClient.Del(ctx, key).Err()
        
    default:
        return fmt.Errorf("unknown cache operation: %s", event.Operation)
    }
}

func (crs *CrossRegionSync) PublishSyncEvent(ctx context.Context, event SyncEvent) error {
    event.Timestamp = time.Now()
    event.Region = "us-west-2" // Current region
    
    messageBody, err := json.Marshal(event)
    if err != nil {
        return fmt.Errorf("failed to marshal sync event: %w", err)
    }
    
    input := &sqs.SendMessageInput{
        QueueUrl:    &crs.queueURL,
        MessageBody: aws.String(string(messageBody)),
        MessageAttributes: map[string]types.MessageAttributeValue{
            "Type": {
                DataType:    aws.String("String"),
                StringValue: aws.String(event.Type),
            },
            "Table": {
                DataType:    aws.String("String"),
                StringValue: aws.String(event.Table),
            },
        },
    }
    
    _, err = crs.sqsClient.SendMessage(ctx, input)
    if err != nil {
        return fmt.Errorf("failed to publish sync event: %w", err)
    }
    
    return nil
}

func main() {
    ctx := context.Background()
    
    sync, err := NewCrossRegionSync(
        "postgres://user:pass@primary-db.us-west-2.rds.amazonaws.com:5432/production",
        "postgres://user:pass@secondary-db.us-east-1.rds.amazonaws.com:5432/production",
        "https://sqs.us-west-2.amazonaws.com/123456789012/cross-region-sync",
    )
    if err != nil {
        log.Fatal(err)
    }
    
    if err := sync.StartSyncWorker(ctx); err != nil {
        log.Fatal(err)
    }
}

Automated Failover Mechanisms

Health Check and Failover Controller

// failover/controller.go
package main

import (
    "context"
    "fmt"
    "log"
    "net/http"
    "time"
    
    "github.com/aws/aws-sdk-go-v2/config"
    "github.com/aws/aws-sdk-go-v2/service/route53"
    "github.com/aws/aws-sdk-go-v2/service/route53/types"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/rest"
)

type FailoverController struct {
    k8sClient     kubernetes.Interface
    route53Client *route53.Client
    hostedZoneID  string
    domainName    string
    primaryIP     string
    secondaryIP   string
    healthChecks  []HealthCheck
}

type HealthCheck struct {
    Name     string
    URL      string
    Timeout  time.Duration
    Interval time.Duration
}

func NewFailoverController(hostedZoneID, domainName, primaryIP, secondaryIP string) (*FailoverController, error) {
    // Initialize Kubernetes client
    config, err := rest.InClusterConfig()
    if err != nil {
        return nil, fmt.Errorf("failed to create k8s config: %w", err)
    }
    
    k8sClient, err := kubernetes.NewForConfig(config)
    if err != nil {
        return nil, fmt.Errorf("failed to create k8s client: %w", err)
    }
    
    // Initialize AWS Route53 client
    cfg, err := config.LoadDefaultConfig(context.TODO())
    if err != nil {
        return nil, fmt.Errorf("failed to load AWS config: %w", err)
    }
    
    route53Client := route53.NewFromConfig(cfg)
    
    healthChecks := []HealthCheck{
        {
            Name:     "application-health",
            URL:      fmt.Sprintf("http://%s/health", primaryIP),
            Timeout:  5 * time.Second,
            Interval: 30 * time.Second,
        },
        {
            Name:     "database-health",
            URL:      fmt.Sprintf("http://%s/db-health", primaryIP),
            Timeout:  10 * time.Second,
            Interval: 60 * time.Second,
        },
    }
    
    return &FailoverController{
        k8sClient:     k8sClient,
        route53Client: route53Client,
        hostedZoneID:  hostedZoneID,
        domainName:    domainName,
        primaryIP:     primaryIP,
        secondaryIP:   secondaryIP,
        healthChecks:  healthChecks,
    }, nil
}

func (fc *FailoverController) Start(ctx context.Context) error {
    log.Println("Starting failover controller")
    
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-ticker.C:
            if err := fc.performHealthChecks(ctx); err != nil {
                log.Printf("Health check failed: %v", err)
            }
        }
    }
}

func (fc *FailoverController) performHealthChecks(ctx context.Context) error {
    allHealthy := true
    
    for _, check := range fc.healthChecks {
        healthy, err := fc.performHealthCheck(ctx, check)
        if err != nil {
            log.Printf("Health check %s failed: %v", check.Name, err)
            allHealthy = false
        } else if !healthy {
            log.Printf("Health check %s is unhealthy", check.Name)
            allHealthy = false
        }
    }
    
    // Get current DNS record
    currentRecord, err := fc.getCurrentDNSRecord(ctx)
    if err != nil {
        return fmt.Errorf("failed to get current DNS record: %w", err)
    }
    
    // Determine if failover is needed
    if !allHealthy && currentRecord == fc.primaryIP {
        log.Println("Primary region is unhealthy, initiating failover to secondary region")
        if err := fc.failoverToSecondary(ctx); err != nil {
            return fmt.Errorf("failover to secondary failed: %w", err)
        }
    } else if allHealthy && currentRecord == fc.secondaryIP {
        log.Println("Primary region is healthy, failing back to primary region")
        if err := fc.failbackToPrimary(ctx); err != nil {
            return fmt.Errorf("failback to primary failed: %w", err)
        }
    }
    
    return nil
}

func (fc *FailoverController) performHealthCheck(ctx context.Context, check HealthCheck) (bool, error) {
    client := &http.Client{
        Timeout: check.Timeout,
    }
    
    req, err := http.NewRequestWithContext(ctx, "GET", check.URL, nil)
    if err != nil {
        return false, err
    }
    
    resp, err := client.Do(req)
    if err != nil {
        return false, err
    }
    defer resp.Body.Close()
    
    return resp.StatusCode >= 200 && resp.StatusCode < 300, nil
}

func (fc *FailoverController) getCurrentDNSRecord(ctx context.Context) (string, error) {
    input := &route53.ListResourceRecordSetsInput{
        HostedZoneId: &fc.hostedZoneID,
        StartName:    &fc.domainName,
        StartType:    types.RRTypeA,
    }
    
    result, err := fc.route53Client.ListResourceRecordSets(ctx, input)
    if err != nil {
        return "", err
    }
    
    for _, record := range result.ResourceRecordSets {
        if *record.Name == fc.domainName+"." && record.Type == types.RRTypeA {
            if len(record.ResourceRecords) > 0 {
                return *record.ResourceRecords[0].Value, nil
            }
        }
    }
    
    return "", fmt.Errorf("DNS record not found")
}

func (fc *FailoverController) failoverToSecondary(ctx context.Context) error {
    return fc.updateDNSRecord(ctx, fc.secondaryIP)
}

func (fc *FailoverController) failbackToPrimary(ctx context.Context) error {
    return fc.updateDNSRecord(ctx, fc.primaryIP)
}

func (fc *FailoverController) updateDNSRecord(ctx context.Context, newIP string) error {
    input := &route53.ChangeResourceRecordSetsInput{
        HostedZoneId: &fc.hostedZoneID,
        ChangeBatch: &types.ChangeBatch{
            Changes: []types.Change{
                {
                    Action: types.ChangeActionUpsert,
                    ResourceRecordSet: &types.ResourceRecordSet{
                        Name: &fc.domainName,
                        Type: types.RRTypeA,
                        TTL:  aws.Int64(60), // Low TTL for faster failover
                        ResourceRecords: []types.ResourceRecord{
                            {
                                Value: &newIP,
                            },
                        },
                    },
                },
            },
        },
    }
    
    result, err := fc.route53Client.ChangeResourceRecordSets(ctx, input)
    if err != nil {
        return fmt.Errorf("failed to update DNS record: %w", err)
    }
    
    log.Printf("DNS record updated to %s, change ID: %s", newIP, *result.ChangeInfo.Id)
    
    // Wait for change to propagate
    return fc.waitForDNSChange(ctx, *result.ChangeInfo.Id)
}

func (fc *FailoverController) waitForDNSChange(ctx context.Context, changeID string) error {
    input := &route53.GetChangeInput{
        Id: &changeID,
    }
    
    for {
        result, err := fc.route53Client.GetChange(ctx, input)
        if err != nil {
            return err
        }
        
        if result.ChangeInfo.Status == types.ChangeStatusInsync {
            log.Printf("DNS change %s is now in sync", changeID)
            return nil
        }
        
        log.Printf("Waiting for DNS change %s to propagate (status: %s)", changeID, result.ChangeInfo.Status)
        time.Sleep(10 * time.Second)
    }
}

func main() {
    ctx := context.Background()
    
    controller, err := NewFailoverController(
        "Z1234567890ABC",                    // Hosted Zone ID
        "api.company.com",                   // Domain name
        "1.2.3.4",                          // Primary IP
        "5.6.7.8",                          // Secondary IP
    )
    if err != nil {
        log.Fatal(err)
    }
    
    if err := controller.Start(ctx); err != nil {
        log.Fatal(err)
    }
}

Testing and Validation

Disaster Recovery Testing Framework

#!/bin/bash
# scripts/dr-test.sh

set -euo pipefail

# Configuration
TEST_TYPE="${1:-full}"  # full, partial, network, database
ENVIRONMENT="${2:-staging}"
NOTIFICATION_WEBHOOK="${NOTIFICATION_WEBHOOK:-}"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Logging function
log() {
    echo -e "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}

success() {
    log "${GREEN}$1${NC}"
}

warning() {
    log "${YELLOW}$1${NC}"
}

error() {
    log "${RED}$1${NC}"
}

# Notification function
notify() {
    local message="$1"
    local status="${2:-info}"
    
    if [[ -n "${NOTIFICATION_WEBHOOK}" ]]; then
        curl -X POST "${NOTIFICATION_WEBHOOK}" \
            -H "Content-Type: application/json" \
            -d "{\"text\":\"DR Test ${status}: ${message}\"}" \
            > /dev/null 2>&1 || true
    fi
}

# Test functions
test_backup_restore() {
    log "Testing backup and restore functionality..."
    
    # Create test data
    kubectl exec -n production deployment/database -- \
        psql -U postgres -d production -c \
        "INSERT INTO dr_test (id, data, created_at) VALUES (999999, 'dr-test-data', NOW());"
    
    # Create backup
    kubectl create job -n velero manual-backup-$(date +%s) \
        --from=cronjob/daily-backup
    
    # Wait for backup to complete
    sleep 60
    
    # Simulate data loss
    kubectl exec -n production deployment/database -- \
        psql -U postgres -d production -c \
        "DELETE FROM dr_test WHERE id = 999999;"
    
    # Restore from backup
    LATEST_BACKUP=$(velero backup get -o json | jq -r '.items[0].metadata.name')
    velero restore create test-restore-$(date +%s) --from-backup "${LATEST_BACKUP}"
    
    # Verify restoration
    sleep 120
    RESTORED_DATA=$(kubectl exec -n production deployment/database -- \
        psql -U postgres -d production -t -c \
        "SELECT data FROM dr_test WHERE id = 999999;")
    
    if [[ "${RESTORED_DATA}" == *"dr-test-data"* ]]; then
        success "Backup and restore test passed"
        return 0
    else
        error "Backup and restore test failed"
        return 1
    fi
}

test_failover() {
    log "Testing automated failover..."
    
    # Get current DNS record
    CURRENT_IP=$(dig +short api.${ENVIRONMENT}.company.com)
    log "Current IP: ${CURRENT_IP}"
    
    # Simulate primary region failure
    kubectl patch deployment -n production application \
        -p '{"spec":{"replicas":0}}'
    
    # Wait for health checks to detect failure
    sleep 180
    
    # Check if DNS has been updated
    NEW_IP=$(dig +short api.${ENVIRONMENT}.company.com)
    
    if [[ "${NEW_IP}" != "${CURRENT_IP}" ]]; then
        success "Failover test passed - DNS updated to ${NEW_IP}"
        
        # Restore primary region
        kubectl patch deployment -n production application \
            -p '{"spec":{"replicas":3}}'
        
        return 0
    else
        error "Failover test failed - DNS not updated"
        
        # Restore primary region
        kubectl patch deployment -n production application \
            -p '{"spec":{"replicas":3}}'
        
        return 1
    fi
}

test_network_partition() {
    log "Testing network partition scenarios..."
    
    # Create network policy to simulate partition
    kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: simulate-partition
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: application
  policyTypes:
  - Egress
  egress:
  - to: []
    ports:
    - protocol: TCP
      port: 53
EOF
    
    # Wait for policy to take effect
    sleep 30
    
    # Check application health
    HEALTH_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
        http://api.${ENVIRONMENT}.company.com/health || echo "000")
    
    # Remove network policy
    kubectl delete networkpolicy simulate-partition -n production
    
    if [[ "${HEALTH_STATUS}" == "503" ]]; then
        success "Network partition test passed - application properly degraded"
        return 0
    else
        warning "Network partition test inconclusive - status: ${HEALTH_STATUS}"
        return 1
    fi
}

test_database_failover() {
    log "Testing database failover..."
    
    # Get current primary database
    PRIMARY_DB=$(kubectl get postgresql -n database -o jsonpath='{.items[0].status.writeService}')
    log "Current primary database: ${PRIMARY_DB}"
    
    # Simulate primary database failure
    kubectl patch postgresql -n database postgresql-primary \
        -p '{"spec":{"instances":0}}' --type=merge
    
    # Wait for failover
    sleep 120
    
    # Check if replica has been promoted
    NEW_PRIMARY=$(kubectl get postgresql -n database -o jsonpath='{.items[0].status.writeService}')
    
    if [[ "${NEW_PRIMARY}" != "${PRIMARY_DB}" ]]; then
        success "Database failover test passed - new primary: ${NEW_PRIMARY}"
        
        # Restore original configuration
        kubectl patch postgresql -n database postgresql-primary \
            -p '{"spec":{"instances":3}}' --type=merge
        
        return 0
    else
        error "Database failover test failed"
        
        # Restore original configuration
        kubectl patch postgresql -n database postgresql-primary \
            -p '{"spec":{"instances":3}}' --type=merge
        
        return 1
    fi
}

# Main test execution
main() {
    log "Starting DR test suite - Type: ${TEST_TYPE}, Environment: ${ENVIRONMENT}"
    notify "Starting DR test suite - Type: ${TEST_TYPE}, Environment: ${ENVIRONMENT}"
    
    local failed_tests=0
    local total_tests=0
    
    case "${TEST_TYPE}" in
        "full")
            tests=("test_backup_restore" "test_failover" "test_network_partition" "test_database_failover")
            ;;
        "partial")
            tests=("test_backup_restore" "test_failover")
            ;;
        "network")
            tests=("test_network_partition")
            ;;
        "database")
            tests=("test_database_failover")
            ;;
        *)
            error "Unknown test type: ${TEST_TYPE}"
            exit 1
            ;;
    esac
    
    for test in "${tests[@]}"; do
        total_tests=$((total_tests + 1))
        log "Running ${test}..."
        
        if ! ${test}; then
            failed_tests=$((failed_tests + 1))
        fi
        
        # Wait between tests
        sleep 30
    done
    
    # Summary
    log "DR Test Summary:"
    log "Total tests: ${total_tests}"
    log "Failed tests: ${failed_tests}"
    log "Success rate: $(( (total_tests - failed_tests) * 100 / total_tests ))%"
    
    if [[ ${failed_tests} -eq 0 ]]; then
        success "All DR tests passed!"
        notify "All DR tests passed!" "success"
        exit 0
    else
        error "${failed_tests} DR tests failed!"
        notify "${failed_tests} DR tests failed!" "error"
        exit 1
    fi
}

# Cleanup function
cleanup() {
    log "Cleaning up test resources..."
    kubectl delete networkpolicy simulate-partition -n production 2>/dev/null || true
    kubectl patch deployment -n production application -p '{"spec":{"replicas":3}}' 2>/dev/null || true
    kubectl patch postgresql -n database postgresql-primary -p '{"spec":{"instances":3}}' --type=merge 2>/dev/null || true
}

# Set up signal handlers
trap cleanup EXIT
trap 'cleanup; exit 130' INT
trap 'cleanup; exit 143' TERM

# Run main function
main "$@"

Conclusion

Implementing a comprehensive disaster recovery and business continuity strategy for cloud-native applications requires careful planning, robust automation, and regular testing. This guide has covered the essential components:

  1. Backup Strategies: Automated, encrypted, and regularly tested backups for all critical data
  2. Multi-Region Deployments: Active-passive and active-active configurations for geographic redundancy
  3. Automated Failover: Health monitoring and automatic traffic routing during failures
  4. Testing Framework: Regular validation of DR procedures and recovery capabilities

Key success factors for DR implementation:

  • Define clear RTO and RPO objectives based on business requirements
  • Automate everything possible to reduce human error during crisis situations
  • Test regularly and comprehensively to ensure procedures work when needed
  • Monitor and alert on all aspects of the DR infrastructure
  • Document and train team members on DR procedures

By following these practices and implementing the solutions outlined in this guide, organizations can achieve robust disaster recovery capabilities that ensure business continuity even in the face of significant infrastructure failures.

Share Article