灾难恢复测试 disaster-recovery-testing

灾难恢复测试是一套系统性的测试流程,用于验证数据备份和恢复程序的有效性,确保在灾难发生时能够迅速恢复业务运营,关键指标包括恢复时间目标(RTO)和恢复点目标(RPO)。

云安全 0 次安装 0 次浏览 更新于 3/3/2026

name: 灾难恢复测试 description: 执行全面的灾难恢复测试,验证恢复程序,并从DR演习中记录经验教训。

灾难恢复测试

概览

实施系统性的灾难恢复测试,以验证恢复程序,测量RTO/RPO,识别差距,并确保团队对实际事件的准备情况。

何时使用

  • 年度DR演习
  • 基础设施变更
  • 新服务部署
  • 合规要求
  • 团队培训
  • 恢复程序验证
  • 跨区域故障转移测试

实施示例

1. DR测试计划和执行

# dr-test-plan.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-test-procedures
  namespace: operations
data:
  dr-test-plan.md: |
    # 灾难恢复测试计划

    ## 测试目标
    - 验证备份恢复程序
    - 验证故障转移机制
    - 测试DNS故障转移
    - 验证恢复后的数据完整性
    - 测量RTO和RPO
    - 培训事件响应团队

    ## 测试前检查清单
    - [ ] 通知利益相关者
    - [ ] 安排4-6小时窗口
    - [ ] 禁用告警以防止噪声
    - [ ] 备份生产数据
    - [ ] 确保DR环境是隔离的
    - [ ] 准备好回滚计划

    ## 测试范围
    - 主数据库故障转移到备用
    - 应用程序故障转移到DR站点
    - DNS解析更新
    - 负载均衡器健康检查
    - 数据同步验证

    ## 成功标准
    - RTO: < 1小时
    - RPO: < 15分钟
    - 零数据丢失
    - 所有服务运行
    - 告警功能正常

    ## 测试后活动
    - 记录时间线
    - 识别差距
    - 更新程序
    - 安排事后分析
    - 更新团队文档

---
apiVersion: batch/v1
kind: Job
metadata:
  name: dr-test-executor
  namespace: operations
spec:
  template:
    spec:
      serviceAccountName: dr-test-sa
      containers:
        - name: executor
          image: alpine:latest
          env:
            - name: TEST_ID
              value: "dr-test-$(date +%s)"
            - name: BACKUP_BUCKET
              value: "s3://my-backups"
            - name: DR_NAMESPACE
              value: "dr-test"
          command:
            - sh
            - -c
            - |
              apk add --no-cache aws-cli kubectl jq postgresql-client mysql-client

              echo "Starting DR Test: $TEST_ID"

              # 第1步:创建测试命名空间
              echo "Creating isolated test environment..."
              kubectl create namespace "$DR_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

              # 第2步:从备份恢复数据库
              echo "Restoring database from latest backup..."
              LATEST_BACKUP=$(aws s3 ls "$BACKUP_BUCKET/databases/" | \
                sort | tail -n 1 | awk '{print $4}')

              aws s3 cp "$BACKUP_BUCKET/databases/$LATEST_BACKUP" - | \
                gunzip | psql postgres://user:pass@dr-db:5432/testdb

              # 第3步:将应用程序部署到DR命名空间
              echo "Deploying application to DR environment..."
              kubectl set image deployment/myapp \
                myapp=myrepo/myapp:production \
                -n "$DR_NAMESPACE"

              # 第4步:运行健康检查
              echo "Running health checks..."
              for i in {1..30}; do
                if curl -sf http://myapp-dr/health > /dev/null; then
                  echo "Health check passed"
                  break
                fi
                echo "Waiting for service to be healthy... ($i/30)"
                sleep 10
              done

              # 第5步:运行烟雾测试
              echo "Running smoke tests..."
              kubectl exec -it deployment/myapp -n "$DR_NAMESPACE" -- \
                npm run test:smoke || exit 1

              # 第6步:验证数据完整性
              echo "Validating data integrity..."
              PROD_RECORD_COUNT=$(psql postgres://user:pass@prod-db:5432/mydb \
                -t -c "SELECT COUNT(*) FROM users;")
              DR_RECORD_COUNT=$(psql postgres://user:pass@dr-db:5432/testdb \
                -t -c "SELECT COUNT(*) FROM users;")

              if [ "$PROD_RECORD_COUNT" -eq "$DR_RECORD_COUNT" ]; then
                echo "Data integrity verified"
              else
                echo "Data integrity check failed"
                exit 1
              fi

              # 第7步:记录指标
              echo "Recording DR test metrics..."
              kubectl logs deployment/myapp -n "$DR_NAMESPACE" | \
                grep "startup_time" | jq '.' > /tmp/dr-metrics-$TEST_ID.json

              echo "DR Test Complete: $TEST_ID"

          restartPolicy: Never

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dr-test-sa
  namespace: operations

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dr-test
rules:
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["create", "get", "list"]
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["create", "get", "list", "patch", "set"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create", "get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dr-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: dr-test
subjects:
  - kind: ServiceAccount
    name: dr-test-sa
    namespace: operations

2. DR测试脚本

#!/bin/bash
# execute-dr-test.sh - 全面DR测试执行

set -euo pipefail

TEST_ID="dr-test-$(date +%Y%m%d-%H%M%S)"
LOG_FILE="/tmp/dr-test-${TEST_ID}.log"
METRICS_FILE="/tmp/dr-metrics-${TEST_ID}.json"

# Logging
exec 1> >(tee -a "$LOG_FILE")
exec 2>&1

log_info() {
    echo "[INFO] $(date '+%Y-%m-%d %H:%M:%S') $1"
}

log_error() {
    echo "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') $1"
}

# 开始时间
START_TIME=$(date +%s)

log_info "Starting DR Test: $TEST_ID"

# 禁用生产监控
log_info "Disabling production alerts..."
aws sns set-topic-attributes \
    --topic-arn "arn:aws:sns:us-east-1:123456789012:prod-alerts" \
    --attribute-name DisplayName \
    --attribute-value "DR Test - Alerts Disabled"

# 第1阶段:备份验证
log_info "Phase 1: Validating backups..."
if ! aws s3 ls s3://my-backups/databases/ | grep -q "sql.gz"; then
    log_error "No valid backups found"
    exit 1
fi

# 第2阶段:环境设置
log_info "Phase 2: Setting up DR environment..."
LATEST_BACKUP=$(aws s3 ls s3://my-backups/databases/ | \
    sort | tail -n 1 | awk '{print $4}')

log_info "Using backup: $LATEST_BACKUP"
aws s3 cp "s3://my-backups/databases/$LATEST_BACKUP" - | gunzip > /tmp/restore.sql

# 第3阶段:数据库恢复
log_info "Phase 3: Restoring database..."
psql -h dr-db.internal -U postgres -d postgres -f /tmp/restore.sql > /dev/null 2>&1

# 第4阶段:应用程序部署
log_info "Phase 4: Deploying application..."
kubectl create namespace dr-test --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f dr-deployment.yaml -n dr-test
kubectl rollout status deployment/myapp -n dr-test --timeout=10m

# 第5阶段:健康检查
log_info "Phase 5: Running health checks..."
HEALTH_CHECK_START=$(date +%s)

for i in {1..60}; do
    if curl -sf --max-time 5 http://myapp-dr.internal/health > /dev/null 2>&1; then
        HEALTH_CHECK_TIME=$(($(date +%s) - HEALTH_CHECK_START))
        log_info "Health check passed in ${HEALTH_CHECK_TIME}s"
        break
    fi
    if [ $i -eq 60 ]; then
        log_error "Health check timeout"
        exit 1
    fi
    sleep 10
done

# 第6阶段:数据完整性
log_info "Phase 6: Validating data integrity..."
PROD_HASH=$(psql -h prod-db.internal -U postgres -d mydb -t -c \
    "SELECT md5(string_agg(CAST(id AS text), ',')) FROM users ORDER BY id;")
DR_HASH=$(psql -h dr-db.internal -U postgres -d mydb -t -c \
    "SELECT md5(string_agg(CAST(id AS text), ',')) FROM users ORDER BY id;")

if [ "$PROD_HASH" = "$DR_HASH" ]; then
    log_info "Data integrity verified"
else
    log_error "Data integrity check failed: $PROD_HASH != $DR_HASH"
fi

# 第7阶段:烟雾测试
log_info "Phase 7: Running smoke tests..."
kubectl exec -it deployment/myapp -n dr-test -- npm run test:smoke || \
    log_error "Smoke tests failed"

# 记录指标
END_TIME=$(date +%s)
TOTAL_TIME=$((END_TIME - START_TIME))
RTO=$TOTAL_TIME
RPO=$(date -d "$(aws s3api head-object --bucket my-backups --key databases/$LATEST_BACKUP --query 'LastModified' --output text)" +%s)

log_info "DR Test Complete"
log_info "Total time: ${TOTAL_TIME}s"
log_info "RTO: ${RTO}s (target: 3600s)"
log_info "RPO: $(date -d @$RPO)"

# 生成报告
cat > "$METRICS_FILE" <<EOF
{
  "test_id": "$TEST_ID",
  "start_time": $START_TIME,
  "end_time": $END_TIME,
  "rto_seconds": $RTO,
  "rpo_timestamp": $RPO,
  "data_integrity": "PASS",
  "health_check": "PASS",
  "smoke_tests": "PASS"
}
EOF

log_info "Metrics saved to: $METRICS_FILE"

# 重新启用监控
log_info "Re-enabling production alerts..."
aws sns set-topic-attributes \
    --topic-arn "arn:aws:sns:us-east-1:123456789012:prod-alerts" \
    --attribute-name DisplayName \
    --attribute-value "Production Alerts"

log_info "Test artifacts: $LOG_FILE, $METRICS_FILE"

3. DR测试自动化

# scheduled-dr-tests.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: quarterly-dr-test
  namespace: operations
spec:
  # 每季度的第一个星期一的凌晨2点运行
  schedule: "0 2 2-8 1,4,7,10 MON"
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          serviceAccountName: dr-test-sa
          containers:
            - name: dr-test
              image: myrepo/dr-test:latest
              command:
                - /usr/local/bin/execute-dr-test.sh
              env:
                - name: SLACK_WEBHOOK
                  valueFrom:
                    secretKeyRef:
                      name: dr-notifications
                      key: slack-webhook
                - name: TEST_MODE
                  value: "full"
          restartPolicy: Never

最佳实践

✅ DO

  • 定期安排DR测试
  • 提前记录程序
  • 在隔离环境中测试
  • 测量实际RTO/RPO
  • 涉及所有团队
  • 自动化验证
  • 记录发现
  • 根据结果更新程序

❌ DON’T

  • 跳过DR测试
  • 在工作时间测试
  • 对生产环境进行测试
  • 忽略测试失败
  • 忽视测试后分析
  • 忘记重新启用监控
  • 使用过时的备份流程
  • 仅每年测试一次

DR测试级别

  • 桌面演练:文件和讨论
  • 模拟:受控的部分故障转移
  • 完整DR:完整系统故障转移
  • 持续:持续的影子操作

关键指标

  • RTO:恢复时间目标
  • RPO:恢复点目标
  • MTPD:平均检测时间
  • MTTR:平均恢复时间

资源