Disaster Recovery and Business Continuity for Cloud-Native Applications: A Comprehensive Strategy
In today’s digital landscape, ensuring business continuity and implementing robust disaster recovery (DR) strategies is critical for maintaining operational resilience. This comprehensive guide explores how to design, implement, and maintain disaster recovery solutions for cloud-native applications, covering everything from backup strategies to automated failover mechanisms.
Understanding Disaster Recovery Fundamentals
Key Metrics and Objectives
Before implementing any DR strategy, it’s essential to understand the key metrics that define your recovery requirements:
- Recovery Time Objective (RTO): Maximum acceptable downtime
- Recovery Point Objective (RPO): Maximum acceptable data loss
- Mean Time to Recovery (MTTR): Average time to restore service
- Mean Time Between Failures (MTBF): Average time between system failures
graph TB
subgraph "Disaster Types"
A[Hardware Failures]
B[Software Failures]
C[Human Errors]
D[Natural Disasters]
E[Cyber Attacks]
F[Network Outages]
end
subgraph "Impact Assessment"
G[Service Disruption]
H[Data Loss]
I[Revenue Impact]
J[Reputation Damage]
K[Compliance Issues]
end
subgraph "Recovery Strategies"
L[Backup & Restore]
M[Pilot Light]
N[Warm Standby]
O[Multi-Site Active/Active]
end
A --> G
B --> H
C --> I
D --> J
E --> K
F --> G
G --> L
H --> M
I --> N
J --> O
K --> O
DR Strategy Classification
Different applications require different DR strategies based on their criticality and requirements:
| Strategy | RTO | RPO | Cost | Complexity | Use Case |
|---|---|---|---|---|---|
| Backup & Restore | Hours to Days | Hours | Low | Low | Non-critical applications |
| Pilot Light | 10-30 minutes | Minutes | Medium | Medium | Important applications |
| Warm Standby | 1-10 minutes | Seconds | High | High | Critical applications |
| Multi-Site Active/Active | Seconds | Near-zero | Very High | Very High | Mission-critical applications |
Comprehensive Backup Strategy
Kubernetes Backup with Velero
Velero is an open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes.
# velero/install.yaml
apiVersion: v1
kind: Namespace
metadata:
name: velero
---
apiVersion: v1
kind: Secret
metadata:
name: cloud-credentials
namespace: velero
type: Opaque
data:
cloud: <base64-encoded-credentials>
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: aws-backup-location
namespace: velero
spec:
provider: aws
objectStorage:
bucket: company-velero-backups
prefix: production-cluster
config:
region: us-west-2
s3ForcePathStyle: "false"
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
name: aws-snapshot-location
namespace: velero
spec:
provider: aws
config:
region: us-west-2
Automated Backup Schedules
# velero/schedules/daily-backup.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
includedNamespaces:
- production
- staging
excludedNamespaces:
- kube-system
- velero
includedResources:
- "*"
excludedResources:
- events
- events.events.k8s.io
labelSelector:
matchLabels:
backup: "enabled"
snapshotVolumes: true
ttl: 720h # 30 days retention
storageLocation: aws-backup-location
volumeSnapshotLocations:
- aws-snapshot-location
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: weekly-full-backup
namespace: velero
spec:
schedule: "0 1 * * 0" # Weekly on Sunday at 1 AM
template:
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
snapshotVolumes: true
ttl: 2160h # 90 days retention
storageLocation: aws-backup-location
Database Backup Automation
#!/bin/bash
# scripts/database-backup.sh
set -euo pipefail
# Configuration
DB_HOST="${DB_HOST:-localhost}"
DB_PORT="${DB_PORT:-5432}"
DB_NAME="${DB_NAME:-production}"
DB_USER="${DB_USER:-postgres}"
BACKUP_BUCKET="${BACKUP_BUCKET:-company-db-backups}"
RETENTION_DAYS="${RETENTION_DAYS:-30}"
ENCRYPTION_KEY="${ENCRYPTION_KEY:-/etc/backup/encryption.key}"
# Generate backup filename with timestamp
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="postgresql_${DB_NAME}_${TIMESTAMP}.sql.gz.enc"
LOCAL_BACKUP="/tmp/${BACKUP_FILE}"
# Function to log messages
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >&2
}
# Function to cleanup on exit
cleanup() {
if [[ -f "${LOCAL_BACKUP}" ]]; then
rm -f "${LOCAL_BACKUP}"
fi
}
trap cleanup EXIT
# Create database backup
log "Starting database backup for ${DB_NAME}"
pg_dump -h "${DB_HOST}" -p "${DB_PORT}" -U "${DB_USER}" -d "${DB_NAME}" \
--verbose --no-password --format=custom --compress=9 \
| gzip \
| openssl enc -aes-256-cbc -salt -in - -out "${LOCAL_BACKUP}" -pass file:"${ENCRYPTION_KEY}"
# Verify backup file was created
if [[ ! -f "${LOCAL_BACKUP}" ]]; then
log "ERROR: Backup file was not created"
exit 1
fi
# Upload to S3
log "Uploading backup to S3"
aws s3 cp "${LOCAL_BACKUP}" "s3://${BACKUP_BUCKET}/postgresql/${BACKUP_FILE}" \
--storage-class STANDARD_IA \
--metadata "database=${DB_NAME},timestamp=${TIMESTAMP}"
# Verify upload
if aws s3 ls "s3://${BACKUP_BUCKET}/postgresql/${BACKUP_FILE}" > /dev/null; then
log "Backup successfully uploaded to S3"
else
log "ERROR: Failed to upload backup to S3"
exit 1
fi
# Cleanup old backups
log "Cleaning up old backups (older than ${RETENTION_DAYS} days)"
CUTOFF_DATE=$(date -d "${RETENTION_DAYS} days ago" +%Y%m%d)
aws s3 ls "s3://${BACKUP_BUCKET}/postgresql/" | while read -r line; do
BACKUP_DATE=$(echo "$line" | awk '{print $4}' | grep -o '[0-9]\{8\}' | head -1)
if [[ "${BACKUP_DATE}" < "${CUTOFF_DATE}" ]]; then
BACKUP_KEY=$(echo "$line" | awk '{print $4}')
log "Deleting old backup: ${BACKUP_KEY}"
aws s3 rm "s3://${BACKUP_BUCKET}/postgresql/${BACKUP_KEY}"
fi
done
log "Database backup completed successfully"
Backup Verification and Testing
// backup/verify.go
package main
import (
"context"
"database/sql"
"fmt"
"log"
"os"
"os/exec"
"time"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/s3"
_ "github.com/lib/pq"
)
type BackupVerifier struct {
s3Client *s3.Client
bucket string
testDBConn *sql.DB
}
func NewBackupVerifier(bucket string, testDBURL string) (*BackupVerifier, error) {
// Initialize AWS S3 client
cfg, err := config.LoadDefaultConfig(context.TODO())
if err != nil {
return nil, fmt.Errorf("failed to load AWS config: %w", err)
}
s3Client := s3.NewFromConfig(cfg)
// Connect to test database
testDB, err := sql.Open("postgres", testDBURL)
if err != nil {
return nil, fmt.Errorf("failed to connect to test database: %w", err)
}
return &BackupVerifier{
s3Client: s3Client,
bucket: bucket,
testDBConn: testDB,
}, nil
}
func (bv *BackupVerifier) VerifyLatestBackup(ctx context.Context) error {
// Find the latest backup
latestBackup, err := bv.findLatestBackup(ctx)
if err != nil {
return fmt.Errorf("failed to find latest backup: %w", err)
}
log.Printf("Verifying backup: %s", latestBackup)
// Download and decrypt backup
localFile, err := bv.downloadBackup(ctx, latestBackup)
if err != nil {
return fmt.Errorf("failed to download backup: %w", err)
}
defer os.Remove(localFile)
// Restore to test database
if err := bv.restoreToTestDB(localFile); err != nil {
return fmt.Errorf("failed to restore backup: %w", err)
}
// Verify data integrity
if err := bv.verifyDataIntegrity(); err != nil {
return fmt.Errorf("data integrity check failed: %w", err)
}
log.Printf("Backup verification completed successfully")
return nil
}
func (bv *BackupVerifier) findLatestBackup(ctx context.Context) (string, error) {
input := &s3.ListObjectsV2Input{
Bucket: &bv.bucket,
Prefix: aws.String("postgresql/"),
}
result, err := bv.s3Client.ListObjectsV2(ctx, input)
if err != nil {
return "", err
}
var latestKey string
var latestTime time.Time
for _, obj := range result.Contents {
if obj.LastModified.After(latestTime) {
latestTime = *obj.LastModified
latestKey = *obj.Key
}
}
if latestKey == "" {
return "", fmt.Errorf("no backups found")
}
return latestKey, nil
}
func (bv *BackupVerifier) downloadBackup(ctx context.Context, key string) (string, error) {
localFile := fmt.Sprintf("/tmp/backup_verify_%d.sql.gz.enc", time.Now().Unix())
// Download from S3
input := &s3.GetObjectInput{
Bucket: &bv.bucket,
Key: &key,
}
result, err := bv.s3Client.GetObject(ctx, input)
if err != nil {
return "", err
}
defer result.Body.Close()
// Save to local file
file, err := os.Create(localFile)
if err != nil {
return "", err
}
defer file.Close()
_, err = io.Copy(file, result.Body)
if err != nil {
return "", err
}
return localFile, nil
}
func (bv *BackupVerifier) restoreToTestDB(backupFile string) error {
// Decrypt backup
decryptedFile := backupFile + ".decrypted"
cmd := exec.Command("openssl", "enc", "-aes-256-cbc", "-d", "-salt",
"-in", backupFile, "-out", decryptedFile,
"-pass", "file:/etc/backup/encryption.key")
if err := cmd.Run(); err != nil {
return fmt.Errorf("failed to decrypt backup: %w", err)
}
defer os.Remove(decryptedFile)
// Decompress
decompressedFile := decryptedFile + ".sql"
cmd = exec.Command("gunzip", "-c", decryptedFile)
output, err := os.Create(decompressedFile)
if err != nil {
return err
}
defer output.Close()
defer os.Remove(decompressedFile)
cmd.Stdout = output
if err := cmd.Run(); err != nil {
return fmt.Errorf("failed to decompress backup: %w", err)
}
// Restore to test database
cmd = exec.Command("pg_restore", "-h", "test-db-host", "-U", "postgres",
"-d", "test_restore", "--clean", "--if-exists", decompressedFile)
if err := cmd.Run(); err != nil {
return fmt.Errorf("failed to restore backup: %w", err)
}
return nil
}
func (bv *BackupVerifier) verifyDataIntegrity() error {
// Perform basic data integrity checks
queries := []string{
"SELECT COUNT(*) FROM users",
"SELECT COUNT(*) FROM orders",
"SELECT COUNT(*) FROM products",
"SELECT MAX(created_at) FROM audit_log",
}
for _, query := range queries {
var result interface{}
err := bv.testDBConn.QueryRow(query).Scan(&result)
if err != nil {
return fmt.Errorf("integrity check failed for query '%s': %w", query, err)
}
log.Printf("Integrity check passed: %s = %v", query, result)
}
return nil
}
func main() {
verifier, err := NewBackupVerifier(
"company-db-backups",
"postgres://postgres:password@test-db-host:5432/test_restore?sslmode=disable",
)
if err != nil {
log.Fatal(err)
}
ctx := context.Background()
if err := verifier.VerifyLatestBackup(ctx); err != nil {
log.Fatal(err)
}
}
Multi-Region Deployment Strategy
Active-Passive Multi-Region Setup
# infrastructure/multi-region/primary-region.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: region-config
namespace: kube-system
data:
region: "us-west-2"
role: "primary"
failover_region: "us-east-1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: application
namespace: production
spec:
replicas: 5
selector:
matchLabels:
app: application
template:
metadata:
labels:
app: application
region: primary
spec:
containers:
- name: app
image: myapp:latest
env:
- name: REGION
value: "us-west-2"
- name: DATABASE_URL
value: "postgres://primary-db.us-west-2.rds.amazonaws.com:5432/production"
- name: REDIS_URL
value: "redis://primary-redis.us-west-2.cache.amazonaws.com:6379"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Database Replication Configuration
# infrastructure/database/postgresql-primary.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgresql-primary
namespace: database
spec:
instances: 3
primaryUpdateStrategy: unsupervised
postgresql:
parameters:
max_connections: "200"
shared_buffers: "256MB"
effective_cache_size: "1GB"
wal_level: "replica"
max_wal_senders: "10"
max_replication_slots: "10"
hot_standby: "on"
bootstrap:
initdb:
database: production
owner: app_user
secret:
name: postgresql-credentials
storage:
size: 100Gi
storageClass: fast-ssd
monitoring:
enabled: true
backup:
retentionPolicy: "30d"
barmanObjectStore:
destinationPath: "s3://company-db-backups/postgresql"
s3Credentials:
accessKeyId:
name: backup-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: backup-credentials
key: SECRET_ACCESS_KEY
wal:
retention: "7d"
data:
retention: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgresql-replica
namespace: database
spec:
instances: 2
bootstrap:
pg_basebackup:
source: postgresql-primary
externalClusters:
- name: postgresql-primary
connectionParameters:
host: postgresql-primary-rw
user: postgres
dbname: postgres
password:
name: postgresql-credentials
key: password
Cross-Region Data Synchronization
// sync/cross_region_sync.go
package main
import (
"context"
"database/sql"
"encoding/json"
"fmt"
"log"
"time"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/sqs"
"github.com/go-redis/redis/v8"
_ "github.com/lib/pq"
)
type CrossRegionSync struct {
primaryDB *sql.DB
secondaryDB *sql.DB
sqsClient *sqs.Client
redisClient *redis.Client
queueURL string
}
type SyncEvent struct {
Type string `json:"type"`
Table string `json:"table"`
Operation string `json:"operation"`
Data map[string]interface{} `json:"data"`
Timestamp time.Time `json:"timestamp"`
Region string `json:"region"`
}
func NewCrossRegionSync(primaryDBURL, secondaryDBURL, queueURL string) (*CrossRegionSync, error) {
// Connect to databases
primaryDB, err := sql.Open("postgres", primaryDBURL)
if err != nil {
return nil, fmt.Errorf("failed to connect to primary DB: %w", err)
}
secondaryDB, err := sql.Open("postgres", secondaryDBURL)
if err != nil {
return nil, fmt.Errorf("failed to connect to secondary DB: %w", err)
}
// Initialize AWS SQS client
cfg, err := config.LoadDefaultConfig(context.TODO())
if err != nil {
return nil, fmt.Errorf("failed to load AWS config: %w", err)
}
sqsClient := sqs.NewFromConfig(cfg)
// Initialize Redis client
redisClient := redis.NewClient(&redis.Options{
Addr: "redis-cluster.cache.amazonaws.com:6379",
Password: "",
DB: 0,
})
return &CrossRegionSync{
primaryDB: primaryDB,
secondaryDB: secondaryDB,
sqsClient: sqsClient,
redisClient: redisClient,
queueURL: queueURL,
}, nil
}
func (crs *CrossRegionSync) StartSyncWorker(ctx context.Context) error {
log.Println("Starting cross-region sync worker")
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
if err := crs.processSyncEvents(ctx); err != nil {
log.Printf("Error processing sync events: %v", err)
time.Sleep(5 * time.Second)
}
}
}
}
func (crs *CrossRegionSync) processSyncEvents(ctx context.Context) error {
// Receive messages from SQS
input := &sqs.ReceiveMessageInput{
QueueUrl: &crs.queueURL,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20,
VisibilityTimeoutSeconds: 300,
}
result, err := crs.sqsClient.ReceiveMessage(ctx, input)
if err != nil {
return fmt.Errorf("failed to receive messages: %w", err)
}
for _, message := range result.Messages {
if err := crs.processSyncEvent(ctx, *message.Body); err != nil {
log.Printf("Failed to process sync event: %v", err)
continue
}
// Delete message after successful processing
_, err := crs.sqsClient.DeleteMessage(ctx, &sqs.DeleteMessageInput{
QueueUrl: &crs.queueURL,
ReceiptHandle: message.ReceiptHandle,
})
if err != nil {
log.Printf("Failed to delete message: %v", err)
}
}
return nil
}
func (crs *CrossRegionSync) processSyncEvent(ctx context.Context, messageBody string) error {
var event SyncEvent
if err := json.Unmarshal([]byte(messageBody), &event); err != nil {
return fmt.Errorf("failed to unmarshal sync event: %w", err)
}
log.Printf("Processing sync event: %s %s on table %s", event.Operation, event.Type, event.Table)
switch event.Type {
case "database":
return crs.syncDatabaseEvent(ctx, event)
case "cache":
return crs.syncCacheEvent(ctx, event)
case "file":
return crs.syncFileEvent(ctx, event)
default:
return fmt.Errorf("unknown sync event type: %s", event.Type)
}
}
func (crs *CrossRegionSync) syncDatabaseEvent(ctx context.Context, event SyncEvent) error {
switch event.Operation {
case "INSERT":
return crs.replicateInsert(ctx, event)
case "UPDATE":
return crs.replicateUpdate(ctx, event)
case "DELETE":
return crs.replicateDelete(ctx, event)
default:
return fmt.Errorf("unknown database operation: %s", event.Operation)
}
}
func (crs *CrossRegionSync) replicateInsert(ctx context.Context, event SyncEvent) error {
// Build INSERT query dynamically
columns := make([]string, 0, len(event.Data))
placeholders := make([]string, 0, len(event.Data))
values := make([]interface{}, 0, len(event.Data))
i := 1
for column, value := range event.Data {
columns = append(columns, column)
placeholders = append(placeholders, fmt.Sprintf("$%d", i))
values = append(values, value)
i++
}
query := fmt.Sprintf(
"INSERT INTO %s (%s) VALUES (%s) ON CONFLICT DO NOTHING",
event.Table,
strings.Join(columns, ", "),
strings.Join(placeholders, ", "),
)
_, err := crs.secondaryDB.ExecContext(ctx, query, values...)
if err != nil {
return fmt.Errorf("failed to replicate insert: %w", err)
}
return nil
}
func (crs *CrossRegionSync) syncCacheEvent(ctx context.Context, event SyncEvent) error {
switch event.Operation {
case "SET":
key := event.Data["key"].(string)
value := event.Data["value"]
ttl := time.Duration(event.Data["ttl"].(float64)) * time.Second
return crs.redisClient.Set(ctx, key, value, ttl).Err()
case "DELETE":
key := event.Data["key"].(string)
return crs.redisClient.Del(ctx, key).Err()
default:
return fmt.Errorf("unknown cache operation: %s", event.Operation)
}
}
func (crs *CrossRegionSync) PublishSyncEvent(ctx context.Context, event SyncEvent) error {
event.Timestamp = time.Now()
event.Region = "us-west-2" // Current region
messageBody, err := json.Marshal(event)
if err != nil {
return fmt.Errorf("failed to marshal sync event: %w", err)
}
input := &sqs.SendMessageInput{
QueueUrl: &crs.queueURL,
MessageBody: aws.String(string(messageBody)),
MessageAttributes: map[string]types.MessageAttributeValue{
"Type": {
DataType: aws.String("String"),
StringValue: aws.String(event.Type),
},
"Table": {
DataType: aws.String("String"),
StringValue: aws.String(event.Table),
},
},
}
_, err = crs.sqsClient.SendMessage(ctx, input)
if err != nil {
return fmt.Errorf("failed to publish sync event: %w", err)
}
return nil
}
func main() {
ctx := context.Background()
sync, err := NewCrossRegionSync(
"postgres://user:pass@primary-db.us-west-2.rds.amazonaws.com:5432/production",
"postgres://user:pass@secondary-db.us-east-1.rds.amazonaws.com:5432/production",
"https://sqs.us-west-2.amazonaws.com/123456789012/cross-region-sync",
)
if err != nil {
log.Fatal(err)
}
if err := sync.StartSyncWorker(ctx); err != nil {
log.Fatal(err)
}
}
Automated Failover Mechanisms
Health Check and Failover Controller
// failover/controller.go
package main
import (
"context"
"fmt"
"log"
"net/http"
"time"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/route53"
"github.com/aws/aws-sdk-go-v2/service/route53/types"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
type FailoverController struct {
k8sClient kubernetes.Interface
route53Client *route53.Client
hostedZoneID string
domainName string
primaryIP string
secondaryIP string
healthChecks []HealthCheck
}
type HealthCheck struct {
Name string
URL string
Timeout time.Duration
Interval time.Duration
}
func NewFailoverController(hostedZoneID, domainName, primaryIP, secondaryIP string) (*FailoverController, error) {
// Initialize Kubernetes client
config, err := rest.InClusterConfig()
if err != nil {
return nil, fmt.Errorf("failed to create k8s config: %w", err)
}
k8sClient, err := kubernetes.NewForConfig(config)
if err != nil {
return nil, fmt.Errorf("failed to create k8s client: %w", err)
}
// Initialize AWS Route53 client
cfg, err := config.LoadDefaultConfig(context.TODO())
if err != nil {
return nil, fmt.Errorf("failed to load AWS config: %w", err)
}
route53Client := route53.NewFromConfig(cfg)
healthChecks := []HealthCheck{
{
Name: "application-health",
URL: fmt.Sprintf("http://%s/health", primaryIP),
Timeout: 5 * time.Second,
Interval: 30 * time.Second,
},
{
Name: "database-health",
URL: fmt.Sprintf("http://%s/db-health", primaryIP),
Timeout: 10 * time.Second,
Interval: 60 * time.Second,
},
}
return &FailoverController{
k8sClient: k8sClient,
route53Client: route53Client,
hostedZoneID: hostedZoneID,
domainName: domainName,
primaryIP: primaryIP,
secondaryIP: secondaryIP,
healthChecks: healthChecks,
}, nil
}
func (fc *FailoverController) Start(ctx context.Context) error {
log.Println("Starting failover controller")
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
if err := fc.performHealthChecks(ctx); err != nil {
log.Printf("Health check failed: %v", err)
}
}
}
}
func (fc *FailoverController) performHealthChecks(ctx context.Context) error {
allHealthy := true
for _, check := range fc.healthChecks {
healthy, err := fc.performHealthCheck(ctx, check)
if err != nil {
log.Printf("Health check %s failed: %v", check.Name, err)
allHealthy = false
} else if !healthy {
log.Printf("Health check %s is unhealthy", check.Name)
allHealthy = false
}
}
// Get current DNS record
currentRecord, err := fc.getCurrentDNSRecord(ctx)
if err != nil {
return fmt.Errorf("failed to get current DNS record: %w", err)
}
// Determine if failover is needed
if !allHealthy && currentRecord == fc.primaryIP {
log.Println("Primary region is unhealthy, initiating failover to secondary region")
if err := fc.failoverToSecondary(ctx); err != nil {
return fmt.Errorf("failover to secondary failed: %w", err)
}
} else if allHealthy && currentRecord == fc.secondaryIP {
log.Println("Primary region is healthy, failing back to primary region")
if err := fc.failbackToPrimary(ctx); err != nil {
return fmt.Errorf("failback to primary failed: %w", err)
}
}
return nil
}
func (fc *FailoverController) performHealthCheck(ctx context.Context, check HealthCheck) (bool, error) {
client := &http.Client{
Timeout: check.Timeout,
}
req, err := http.NewRequestWithContext(ctx, "GET", check.URL, nil)
if err != nil {
return false, err
}
resp, err := client.Do(req)
if err != nil {
return false, err
}
defer resp.Body.Close()
return resp.StatusCode >= 200 && resp.StatusCode < 300, nil
}
func (fc *FailoverController) getCurrentDNSRecord(ctx context.Context) (string, error) {
input := &route53.ListResourceRecordSetsInput{
HostedZoneId: &fc.hostedZoneID,
StartName: &fc.domainName,
StartType: types.RRTypeA,
}
result, err := fc.route53Client.ListResourceRecordSets(ctx, input)
if err != nil {
return "", err
}
for _, record := range result.ResourceRecordSets {
if *record.Name == fc.domainName+"." && record.Type == types.RRTypeA {
if len(record.ResourceRecords) > 0 {
return *record.ResourceRecords[0].Value, nil
}
}
}
return "", fmt.Errorf("DNS record not found")
}
func (fc *FailoverController) failoverToSecondary(ctx context.Context) error {
return fc.updateDNSRecord(ctx, fc.secondaryIP)
}
func (fc *FailoverController) failbackToPrimary(ctx context.Context) error {
return fc.updateDNSRecord(ctx, fc.primaryIP)
}
func (fc *FailoverController) updateDNSRecord(ctx context.Context, newIP string) error {
input := &route53.ChangeResourceRecordSetsInput{
HostedZoneId: &fc.hostedZoneID,
ChangeBatch: &types.ChangeBatch{
Changes: []types.Change{
{
Action: types.ChangeActionUpsert,
ResourceRecordSet: &types.ResourceRecordSet{
Name: &fc.domainName,
Type: types.RRTypeA,
TTL: aws.Int64(60), // Low TTL for faster failover
ResourceRecords: []types.ResourceRecord{
{
Value: &newIP,
},
},
},
},
},
},
}
result, err := fc.route53Client.ChangeResourceRecordSets(ctx, input)
if err != nil {
return fmt.Errorf("failed to update DNS record: %w", err)
}
log.Printf("DNS record updated to %s, change ID: %s", newIP, *result.ChangeInfo.Id)
// Wait for change to propagate
return fc.waitForDNSChange(ctx, *result.ChangeInfo.Id)
}
func (fc *FailoverController) waitForDNSChange(ctx context.Context, changeID string) error {
input := &route53.GetChangeInput{
Id: &changeID,
}
for {
result, err := fc.route53Client.GetChange(ctx, input)
if err != nil {
return err
}
if result.ChangeInfo.Status == types.ChangeStatusInsync {
log.Printf("DNS change %s is now in sync", changeID)
return nil
}
log.Printf("Waiting for DNS change %s to propagate (status: %s)", changeID, result.ChangeInfo.Status)
time.Sleep(10 * time.Second)
}
}
func main() {
ctx := context.Background()
controller, err := NewFailoverController(
"Z1234567890ABC", // Hosted Zone ID
"api.company.com", // Domain name
"1.2.3.4", // Primary IP
"5.6.7.8", // Secondary IP
)
if err != nil {
log.Fatal(err)
}
if err := controller.Start(ctx); err != nil {
log.Fatal(err)
}
}
Testing and Validation
Disaster Recovery Testing Framework
#!/bin/bash
# scripts/dr-test.sh
set -euo pipefail
# Configuration
TEST_TYPE="${1:-full}" # full, partial, network, database
ENVIRONMENT="${2:-staging}"
NOTIFICATION_WEBHOOK="${NOTIFICATION_WEBHOOK:-}"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}
success() {
log "${GREEN}✓ $1${NC}"
}
warning() {
log "${YELLOW}⚠ $1${NC}"
}
error() {
log "${RED}✗ $1${NC}"
}
# Notification function
notify() {
local message="$1"
local status="${2:-info}"
if [[ -n "${NOTIFICATION_WEBHOOK}" ]]; then
curl -X POST "${NOTIFICATION_WEBHOOK}" \
-H "Content-Type: application/json" \
-d "{\"text\":\"DR Test ${status}: ${message}\"}" \
> /dev/null 2>&1 || true
fi
}
# Test functions
test_backup_restore() {
log "Testing backup and restore functionality..."
# Create test data
kubectl exec -n production deployment/database -- \
psql -U postgres -d production -c \
"INSERT INTO dr_test (id, data, created_at) VALUES (999999, 'dr-test-data', NOW());"
# Create backup
kubectl create job -n velero manual-backup-$(date +%s) \
--from=cronjob/daily-backup
# Wait for backup to complete
sleep 60
# Simulate data loss
kubectl exec -n production deployment/database -- \
psql -U postgres -d production -c \
"DELETE FROM dr_test WHERE id = 999999;"
# Restore from backup
LATEST_BACKUP=$(velero backup get -o json | jq -r '.items[0].metadata.name')
velero restore create test-restore-$(date +%s) --from-backup "${LATEST_BACKUP}"
# Verify restoration
sleep 120
RESTORED_DATA=$(kubectl exec -n production deployment/database -- \
psql -U postgres -d production -t -c \
"SELECT data FROM dr_test WHERE id = 999999;")
if [[ "${RESTORED_DATA}" == *"dr-test-data"* ]]; then
success "Backup and restore test passed"
return 0
else
error "Backup and restore test failed"
return 1
fi
}
test_failover() {
log "Testing automated failover..."
# Get current DNS record
CURRENT_IP=$(dig +short api.${ENVIRONMENT}.company.com)
log "Current IP: ${CURRENT_IP}"
# Simulate primary region failure
kubectl patch deployment -n production application \
-p '{"spec":{"replicas":0}}'
# Wait for health checks to detect failure
sleep 180
# Check if DNS has been updated
NEW_IP=$(dig +short api.${ENVIRONMENT}.company.com)
if [[ "${NEW_IP}" != "${CURRENT_IP}" ]]; then
success "Failover test passed - DNS updated to ${NEW_IP}"
# Restore primary region
kubectl patch deployment -n production application \
-p '{"spec":{"replicas":3}}'
return 0
else
error "Failover test failed - DNS not updated"
# Restore primary region
kubectl patch deployment -n production application \
-p '{"spec":{"replicas":3}}'
return 1
fi
}
test_network_partition() {
log "Testing network partition scenarios..."
# Create network policy to simulate partition
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: simulate-partition
namespace: production
spec:
podSelector:
matchLabels:
app: application
policyTypes:
- Egress
egress:
- to: []
ports:
- protocol: TCP
port: 53
EOF
# Wait for policy to take effect
sleep 30
# Check application health
HEALTH_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
http://api.${ENVIRONMENT}.company.com/health || echo "000")
# Remove network policy
kubectl delete networkpolicy simulate-partition -n production
if [[ "${HEALTH_STATUS}" == "503" ]]; then
success "Network partition test passed - application properly degraded"
return 0
else
warning "Network partition test inconclusive - status: ${HEALTH_STATUS}"
return 1
fi
}
test_database_failover() {
log "Testing database failover..."
# Get current primary database
PRIMARY_DB=$(kubectl get postgresql -n database -o jsonpath='{.items[0].status.writeService}')
log "Current primary database: ${PRIMARY_DB}"
# Simulate primary database failure
kubectl patch postgresql -n database postgresql-primary \
-p '{"spec":{"instances":0}}' --type=merge
# Wait for failover
sleep 120
# Check if replica has been promoted
NEW_PRIMARY=$(kubectl get postgresql -n database -o jsonpath='{.items[0].status.writeService}')
if [[ "${NEW_PRIMARY}" != "${PRIMARY_DB}" ]]; then
success "Database failover test passed - new primary: ${NEW_PRIMARY}"
# Restore original configuration
kubectl patch postgresql -n database postgresql-primary \
-p '{"spec":{"instances":3}}' --type=merge
return 0
else
error "Database failover test failed"
# Restore original configuration
kubectl patch postgresql -n database postgresql-primary \
-p '{"spec":{"instances":3}}' --type=merge
return 1
fi
}
# Main test execution
main() {
log "Starting DR test suite - Type: ${TEST_TYPE}, Environment: ${ENVIRONMENT}"
notify "Starting DR test suite - Type: ${TEST_TYPE}, Environment: ${ENVIRONMENT}"
local failed_tests=0
local total_tests=0
case "${TEST_TYPE}" in
"full")
tests=("test_backup_restore" "test_failover" "test_network_partition" "test_database_failover")
;;
"partial")
tests=("test_backup_restore" "test_failover")
;;
"network")
tests=("test_network_partition")
;;
"database")
tests=("test_database_failover")
;;
*)
error "Unknown test type: ${TEST_TYPE}"
exit 1
;;
esac
for test in "${tests[@]}"; do
total_tests=$((total_tests + 1))
log "Running ${test}..."
if ! ${test}; then
failed_tests=$((failed_tests + 1))
fi
# Wait between tests
sleep 30
done
# Summary
log "DR Test Summary:"
log "Total tests: ${total_tests}"
log "Failed tests: ${failed_tests}"
log "Success rate: $(( (total_tests - failed_tests) * 100 / total_tests ))%"
if [[ ${failed_tests} -eq 0 ]]; then
success "All DR tests passed!"
notify "All DR tests passed!" "success"
exit 0
else
error "${failed_tests} DR tests failed!"
notify "${failed_tests} DR tests failed!" "error"
exit 1
fi
}
# Cleanup function
cleanup() {
log "Cleaning up test resources..."
kubectl delete networkpolicy simulate-partition -n production 2>/dev/null || true
kubectl patch deployment -n production application -p '{"spec":{"replicas":3}}' 2>/dev/null || true
kubectl patch postgresql -n database postgresql-primary -p '{"spec":{"instances":3}}' --type=merge 2>/dev/null || true
}
# Set up signal handlers
trap cleanup EXIT
trap 'cleanup; exit 130' INT
trap 'cleanup; exit 143' TERM
# Run main function
main "$@"
Conclusion
Implementing a comprehensive disaster recovery and business continuity strategy for cloud-native applications requires careful planning, robust automation, and regular testing. This guide has covered the essential components:
- Backup Strategies: Automated, encrypted, and regularly tested backups for all critical data
- Multi-Region Deployments: Active-passive and active-active configurations for geographic redundancy
- Automated Failover: Health monitoring and automatic traffic routing during failures
- Testing Framework: Regular validation of DR procedures and recovery capabilities
Key success factors for DR implementation:
- Define clear RTO and RPO objectives based on business requirements
- Automate everything possible to reduce human error during crisis situations
- Test regularly and comprehensively to ensure procedures work when needed
- Monitor and alert on all aspects of the DR infrastructure
- Document and train team members on DR procedures
By following these practices and implementing the solutions outlined in this guide, organizations can achieve robust disaster recovery capabilities that ensure business continuity even in the face of significant infrastructure failures.