BackupandDisasterRecoverySkill backup-disaster-recovery

这是一个关于如何实施备份策略、灾难恢复计划和数据恢复程序的技能,涵盖了数据保护、业务连续性和从基础设施故障中快速恢复的关键方面。关键词包括:备份策略、灾难恢复、数据保护、业务连续性、基础设施故障恢复。

云安全 0 次安装 0 次浏览 更新于 3/3/2026

以下是对提供的备份和灾难恢复策略的中文翻译,保持原有格式不变:

备份和灾难恢复

概览

设计并实施全面的备份和灾难恢复策略,以确保数据保护、业务连续性和从基础设施故障中快速恢复。

何时使用

  • 数据保护和合规性
  • 业务连续性计划
  • 灾难恢复计划
  • 点时间恢复
  • 跨区域故障转移
  • 数据迁移
  • 合规性和审计要求
  • 恢复时间目标(RTO)优化

实施示例

1. 数据库备份配置

# postgres-backup-cronjob.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: backup-script
  namespace: databases
data:
  backup.sh: |
    #!/bin/bash
    set -euo pipefail

    BACKUP_DIR="/backups/postgresql"
    RETENTION_DAYS=30
    DB_HOST="${POSTGRES_HOST}"
    DB_PORT="${POSTGRES_PORT:-5432}"
    DB_USER="${POSTGRES_USER}"
    DB_PASSWORD="${POSTGRES_PASSWORD}"

    export PGPASSWORD="$DB_PASSWORD"

    # 创建备份目录
    mkdir -p "$BACKUP_DIR"

    # 完整备份
    BACKUP_FILE="$BACKUP_DIR/full-$(date +%Y%m%d-%H%M%S).sql"
    echo "Starting backup to $BACKUP_FILE"
    pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -v \
      --format=plain --no-owner --no-privileges > "$BACKUP_FILE"

    # 压缩备份
    gzip "$BACKUP_FILE"
    echo "Backup compressed: ${BACKUP_FILE}.gz"

    # 上传到S3
    aws s3 cp "${BACKUP_FILE}.gz" \
      "s3://my-backups/postgres/$(date +%Y/%m/%d)/"

    # 清理本地旧备份
    find "$BACKUP_DIR" -type f -mtime +7 -delete

    # 验证备份
    if pg_restore -d "postgresql://$DB_USER@$DB_HOST:$DB_PORT/test_restore" \
       "${BACKUP_FILE}.gz" --single-transaction 2>/dev/null; then
      echo "Backup verification successful"
      dropdb -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" test_restore
    fi

    echo "Backup complete: ${BACKUP_FILE}.gz"

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: databases
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: backup-sa
          containers:
            - name: backup
              image: postgres:15-alpine
              env:
                - name: POSTGRES_HOST
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: host
                - name: POSTGRES_USER
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: username
                - name: POSTGRES_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: password
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: access-key
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: secret-key
              volumeMounts:
                - name: backup-script
                  mountPath: /backup
                - name: backup-storage
                  mountPath: /backups
              command:
                - sh
                - -c
                - apk add --no-cache aws-cli && bash /backup/backup.sh
          volumes:
            - name: backup-script
              configMap:
                name: backup-script
                defaultMode: 0755
            - name: backup-storage
              emptyDir:
                sizeLimit: 100Gi
          restartPolicy: OnFailure

2. 灾难恢复计划模板

# disaster-recovery-plan.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-procedures
  namespace: operations
data:
  dr-runbook.md: |
    # 灾难恢复运行手册

    ## RTO和RPO目标
    - RTO (Recovery Time Objective): 4小时
    - RPO (Recovery Point Objective): 1小时

    ## 灾难前检查清单
    - [ ] 验证备份是否为最新
    - [ ] 测试备份恢复过程
    - [ ] 验证DR站点资源是否已配置
    - [ ] 确认故障转移DNS已配置

    ## 主区域故障

    ### 检测(0-15分钟)
    - 警报系统检测到主区域宕机
    - 事故指挥官宣布
    - 在Slack #incidents中打开战情室

    ### 初始行动(15-30分钟)
    - 验证主区域是否真的宕机
    - 检查次要区域的备份系统
    - 验证最新的备份时间戳

    ### 故障转移程序(30分钟-2小时)
    1. 验证备份完整性
    2. 从最新备份恢复数据库
    3. 更新应用程序配置
    4. 执行DNS故障转移到次要区域
    5. 验证应用程序健康

    ### 恢复步骤
    1. 从备份恢复:`restore-backup.sh --backup-id=latest`
    2. 更新DNS:`aws route53 change-resource-record-sets --cli-input-json file://failover.json`
    3. 验证:`curl https://myapp.com/health`
    4. 运行烟雾测试
    5. 监控错误率和性能

    ## 灾难后
    - 记录时间线和RCA
    - 更新运行手册
    - 安排事后回顾
    - 再次测试备份

---
apiVersion: v1
kind: Secret
metadata:
  name: dr-credentials
  namespace: operations
type: Opaque
stringData:
  backup_aws_access_key: "AKIAIOSFODNN7EXAMPLE"
  backup_aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
  dr_site_password: "secure-password-here"

3. 备份和恢复脚本

#!/bin/bash
# backup-restore.sh - 完整的备份和恢复工具

set -euo pipefail

BACKUP_BUCKET="s3://my-backups"
BACKUP_RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# 颜色
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

log_info() {
    echo -e "${GREEN}[INFO]${NC} $1"
}

log_error() {
    echo -e "${RED}[ERROR]${NC} $1"
}

log_warn() {
    echo -e "${YELLOW}[WARN]${NC} $1"
}

# 备份函数
backup_all() {
    local environment=$1
    log_info "Starting backup for $environment environment"

    # 备份数据库
    log_info "Backing up databases..."
    for db in myapp_db analytics_db; do
        local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${TIMESTAMP}.sql.gz"
        pg_dump "$db" | gzip | aws s3 cp - "$backup_file"
        log_info "Backed up $db to $backup_file"
    done

    # 备份Kubernetes资源
    log_info "Backing up Kubernetes resources..."
    kubectl get all,configmap,secret,ingress,pvc -A -o yaml | \
        gzip | aws s3 cp - "$BACKUP_BUCKET/$environment/kubernetes-${TIMESTAMP}.yaml.gz"
    log_info "Kubernetes resources backed up"

    # 备份卷
    log_info "Backing up persistent volumes..."
    for pvc in $(kubectl get pvc -A -o name); do
        local pvc_name=$(echo $pvc | cut -d'/' -f2)
        log_info "Backing up PVC: $pvc_name"
        kubectl exec -n default -it backup-pod -- \
            tar czf - /data | aws s3 cp - "$BACKUP_BUCKET/$environment/volumes/${pvc_name}-${TIMESTAMP}.tar.gz"
    done

    log_info "All backups completed successfully"
}

# 恢复函数
restore_all() {
    local environment=$1
    local backup_date=$2

    log_warn "Restoring from backup date: $backup_date"
    read -p "Are you sure? (yes/no): " confirm
    if [ "$confirm" != "yes" ]; then
        log_error "Restore cancelled"
        exit 1
    fi

    # 恢复数据库
    log_info "Restoring databases..."
    for db in myapp_db analytics_db; do
        local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${backup_date}.sql.gz"
        log_info "Restoring $db from $backup_file"
        aws s3 cp "$backup_file" - | gunzip | psql "$db"
    done

    # 恢复Kubernetes资源
    log_info "Restoring Kubernetes resources..."
    local k8s_backup="$BACKUP_BUCKET/$environment/kubernetes-${backup_date}.yaml.gz"
    aws s3 cp "$k8s_backup" - | gunzip | kubectl apply -f -

    log_info "Restore completed successfully"
}

# 测试恢复
test_restore() {
    local environment=$1

    log_info "Testing restore procedure..."

    # 获取最新备份
    local latest_backup=$(aws s3 ls "$BACKUP_BUCKET/$environment/databases/" | \
        sort | tail -n 1 | awk '{print $4}')

    if [ -z "$latest_backup" ]; then
        log_error "No backups found"
        exit 1
    fi

    log_info "Testing restore from: $latest_backup"

    # 创建测试数据库
    psql -c "CREATE DATABASE test_restore_$(date +%s);"

    # 下载并恢复
    aws s3 cp "$BACKUP_BUCKET/$environment/databases/$latest_backup" - | \
        gunzip | psql "test_restore_$(date +%s)"

    log_info "Test restore successful"
}

# 列出备份
list_backups() {
    local environment=$1
    log_info "Available backups for $environment:"
    aws s3 ls "$BACKUP_BUCKET/$environment/" --recursive | grep -E "\.sql\.gz|\.yaml\.gz|\.tar\.gz"
}

# 清理旧备份
cleanup_old_backups() {
    local environment=$1
    log_info "Cleaning up backups older than $BACKUP_RETENTION_DAYS days"

    find "$BACKUP_BUCKET/$environment" -type f -mtime "+$BACKUP_RETENTION_DAYS" -delete
    log_info "Cleanup completed"
}

# 主函数
main() {
    case "${1:-}" in
        backup)
            backup_all "${2:-production}"
            ;;
        restore)
            restore_all "${2:-production}" "${3:-}"
            ;;
        test)
            test_restore "${2:-production}"
            ;;
        list)
            list_backups "${2:-production}"
            ;;
        cleanup)
            cleanup_old_backups "${2:-production}"
            ;;
        *)
            echo "Usage: $0 {backup|restore|test|list|cleanup} [environment] [backup-date]"
            exit 1
            ;;
    esac
}

main "$@"

4. 跨区域故障转移

# route53-failover.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: failover-config
  namespace: operations
data:
  failover.sh: |
    #!/bin/bash
    set -euo pipefail

    PRIMARY_REGION="us-east-1"
    SECONDARY_REGION="us-west-2"
    DOMAIN="myapp.com"
    HOSTED_ZONE_ID="Z1234567890ABC"

    echo "Initiating failover to $SECONDARY_REGION"

    # 获取主端点
    PRIMARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
      --region "$PRIMARY_REGION" \
      --query 'LoadBalancers[0].DNSName' \
      --output text)

    # 获取次要端点
    SECONDARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
      --region "$SECONDARY_REGION" \
      --query 'LoadBalancers[0].DNSName' \
      --output text)

    # 更新Route53以故障转移
    aws route53 change-resource-record-sets \
      --hosted-zone-id "$HOSTED_ZONE_ID" \
      --change-batch '{
        "Changes": [
          {
            "Action": "UPSERT",
            "ResourceRecordSet": {
              "Name": "'$DOMAIN'",
              "Type": "A",
              "TTL": 60,
              "SetIdentifier": "Primary",
              "Failover": "PRIMARY",
              "AliasTarget": {
                "HostedZoneId": "Z35SXDOTRQ7X7K",
                "DNSName": "'$PRIMARY_ENDPOINT'",
                "EvaluateTargetHealth": true
              }
            }
          },
          {
            "Action": "UPSERT",
            "ResourceRecordSet": {
              "Name": "'$DOMAIN'",
              "Type": "A",
              "TTL": 60,
              "SetIdentifier": "Secondary",
              "Failover": "SECONDARY",
              "AliasTarget": {
                "HostedZoneId": "Z35SXDOTRQ7X7K",
                "DNSName": "'$SECONDARY_ENDPOINT'",
                "EvaluateTargetHealth": false
              }
            }
          }
        ]
      }'

    echo "Failover completed"

最佳实践

✅ DO

  • 定期进行备份测试
  • 使用多个备份位置
  • 实施自动化备份
  • 文档化恢复程序
  • 定期测试故障转移程序
  • 监控备份完成情况
  • 使用不可变备份
  • 加密备份数据在休息和传输中

❌ DON’T

  • 依赖单一备份位置
  • 忽略备份失败
  • 将备份与生产数据一起存储
  • 跳过测试恢复程序
  • 超出恢复速度需求过度压缩备份
  • 忘记验证备份完整性
  • 将加密密钥与备份一起存储
  • 假设备份自动工作

资源