以下是对提供的备份和灾难恢复策略的中文翻译,保持原有格式不变:
备份和灾难恢复
概览
设计并实施全面的备份和灾难恢复策略,以确保数据保护、业务连续性和从基础设施故障中快速恢复。
何时使用
- 数据保护和合规性
- 业务连续性计划
- 灾难恢复计划
- 点时间恢复
- 跨区域故障转移
- 数据迁移
- 合规性和审计要求
- 恢复时间目标(RTO)优化
实施示例
1. 数据库备份配置
# postgres-backup-cronjob.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: backup-script
namespace: databases
data:
backup.sh: |
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/backups/postgresql"
RETENTION_DAYS=30
DB_HOST="${POSTGRES_HOST}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_USER="${POSTGRES_USER}"
DB_PASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD="$DB_PASSWORD"
# 创建备份目录
mkdir -p "$BACKUP_DIR"
# 完整备份
BACKUP_FILE="$BACKUP_DIR/full-$(date +%Y%m%d-%H%M%S).sql"
echo "Starting backup to $BACKUP_FILE"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -v \
--format=plain --no-owner --no-privileges > "$BACKUP_FILE"
# 压缩备份
gzip "$BACKUP_FILE"
echo "Backup compressed: ${BACKUP_FILE}.gz"
# 上传到S3
aws s3 cp "${BACKUP_FILE}.gz" \
"s3://my-backups/postgres/$(date +%Y/%m/%d)/"
# 清理本地旧备份
find "$BACKUP_DIR" -type f -mtime +7 -delete
# 验证备份
if pg_restore -d "postgresql://$DB_USER@$DB_HOST:$DB_PORT/test_restore" \
"${BACKUP_FILE}.gz" --single-transaction 2>/dev/null; then
echo "Backup verification successful"
dropdb -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" test_restore
fi
echo "Backup complete: ${BACKUP_FILE}.gz"
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: databases
spec:
schedule: "0 2 * * *" # 2 AM daily
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: backup-sa
containers:
- name: backup
image: postgres:15-alpine
env:
- name: POSTGRES_HOST
valueFrom:
secretKeyRef:
name: postgres-credentials
key: host
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-key
volumeMounts:
- name: backup-script
mountPath: /backup
- name: backup-storage
mountPath: /backups
command:
- sh
- -c
- apk add --no-cache aws-cli && bash /backup/backup.sh
volumes:
- name: backup-script
configMap:
name: backup-script
defaultMode: 0755
- name: backup-storage
emptyDir:
sizeLimit: 100Gi
restartPolicy: OnFailure
2. 灾难恢复计划模板
# disaster-recovery-plan.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-procedures
namespace: operations
data:
dr-runbook.md: |
# 灾难恢复运行手册
## RTO和RPO目标
- RTO (Recovery Time Objective): 4小时
- RPO (Recovery Point Objective): 1小时
## 灾难前检查清单
- [ ] 验证备份是否为最新
- [ ] 测试备份恢复过程
- [ ] 验证DR站点资源是否已配置
- [ ] 确认故障转移DNS已配置
## 主区域故障
### 检测(0-15分钟)
- 警报系统检测到主区域宕机
- 事故指挥官宣布
- 在Slack #incidents中打开战情室
### 初始行动(15-30分钟)
- 验证主区域是否真的宕机
- 检查次要区域的备份系统
- 验证最新的备份时间戳
### 故障转移程序(30分钟-2小时)
1. 验证备份完整性
2. 从最新备份恢复数据库
3. 更新应用程序配置
4. 执行DNS故障转移到次要区域
5. 验证应用程序健康
### 恢复步骤
1. 从备份恢复:`restore-backup.sh --backup-id=latest`
2. 更新DNS:`aws route53 change-resource-record-sets --cli-input-json file://failover.json`
3. 验证:`curl https://myapp.com/health`
4. 运行烟雾测试
5. 监控错误率和性能
## 灾难后
- 记录时间线和RCA
- 更新运行手册
- 安排事后回顾
- 再次测试备份
---
apiVersion: v1
kind: Secret
metadata:
name: dr-credentials
namespace: operations
type: Opaque
stringData:
backup_aws_access_key: "AKIAIOSFODNN7EXAMPLE"
backup_aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
dr_site_password: "secure-password-here"
3. 备份和恢复脚本
#!/bin/bash
# backup-restore.sh - 完整的备份和恢复工具
set -euo pipefail
BACKUP_BUCKET="s3://my-backups"
BACKUP_RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# 颜色
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
log_warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
# 备份函数
backup_all() {
local environment=$1
log_info "Starting backup for $environment environment"
# 备份数据库
log_info "Backing up databases..."
for db in myapp_db analytics_db; do
local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${TIMESTAMP}.sql.gz"
pg_dump "$db" | gzip | aws s3 cp - "$backup_file"
log_info "Backed up $db to $backup_file"
done
# 备份Kubernetes资源
log_info "Backing up Kubernetes resources..."
kubectl get all,configmap,secret,ingress,pvc -A -o yaml | \
gzip | aws s3 cp - "$BACKUP_BUCKET/$environment/kubernetes-${TIMESTAMP}.yaml.gz"
log_info "Kubernetes resources backed up"
# 备份卷
log_info "Backing up persistent volumes..."
for pvc in $(kubectl get pvc -A -o name); do
local pvc_name=$(echo $pvc | cut -d'/' -f2)
log_info "Backing up PVC: $pvc_name"
kubectl exec -n default -it backup-pod -- \
tar czf - /data | aws s3 cp - "$BACKUP_BUCKET/$environment/volumes/${pvc_name}-${TIMESTAMP}.tar.gz"
done
log_info "All backups completed successfully"
}
# 恢复函数
restore_all() {
local environment=$1
local backup_date=$2
log_warn "Restoring from backup date: $backup_date"
read -p "Are you sure? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
log_error "Restore cancelled"
exit 1
fi
# 恢复数据库
log_info "Restoring databases..."
for db in myapp_db analytics_db; do
local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${backup_date}.sql.gz"
log_info "Restoring $db from $backup_file"
aws s3 cp "$backup_file" - | gunzip | psql "$db"
done
# 恢复Kubernetes资源
log_info "Restoring Kubernetes resources..."
local k8s_backup="$BACKUP_BUCKET/$environment/kubernetes-${backup_date}.yaml.gz"
aws s3 cp "$k8s_backup" - | gunzip | kubectl apply -f -
log_info "Restore completed successfully"
}
# 测试恢复
test_restore() {
local environment=$1
log_info "Testing restore procedure..."
# 获取最新备份
local latest_backup=$(aws s3 ls "$BACKUP_BUCKET/$environment/databases/" | \
sort | tail -n 1 | awk '{print $4}')
if [ -z "$latest_backup" ]; then
log_error "No backups found"
exit 1
fi
log_info "Testing restore from: $latest_backup"
# 创建测试数据库
psql -c "CREATE DATABASE test_restore_$(date +%s);"
# 下载并恢复
aws s3 cp "$BACKUP_BUCKET/$environment/databases/$latest_backup" - | \
gunzip | psql "test_restore_$(date +%s)"
log_info "Test restore successful"
}
# 列出备份
list_backups() {
local environment=$1
log_info "Available backups for $environment:"
aws s3 ls "$BACKUP_BUCKET/$environment/" --recursive | grep -E "\.sql\.gz|\.yaml\.gz|\.tar\.gz"
}
# 清理旧备份
cleanup_old_backups() {
local environment=$1
log_info "Cleaning up backups older than $BACKUP_RETENTION_DAYS days"
find "$BACKUP_BUCKET/$environment" -type f -mtime "+$BACKUP_RETENTION_DAYS" -delete
log_info "Cleanup completed"
}
# 主函数
main() {
case "${1:-}" in
backup)
backup_all "${2:-production}"
;;
restore)
restore_all "${2:-production}" "${3:-}"
;;
test)
test_restore "${2:-production}"
;;
list)
list_backups "${2:-production}"
;;
cleanup)
cleanup_old_backups "${2:-production}"
;;
*)
echo "Usage: $0 {backup|restore|test|list|cleanup} [environment] [backup-date]"
exit 1
;;
esac
}
main "$@"
4. 跨区域故障转移
# route53-failover.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: failover-config
namespace: operations
data:
failover.sh: |
#!/bin/bash
set -euo pipefail
PRIMARY_REGION="us-east-1"
SECONDARY_REGION="us-west-2"
DOMAIN="myapp.com"
HOSTED_ZONE_ID="Z1234567890ABC"
echo "Initiating failover to $SECONDARY_REGION"
# 获取主端点
PRIMARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
--region "$PRIMARY_REGION" \
--query 'LoadBalancers[0].DNSName' \
--output text)
# 获取次要端点
SECONDARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
--region "$SECONDARY_REGION" \
--query 'LoadBalancers[0].DNSName' \
--output text)
# 更新Route53以故障转移
aws route53 change-resource-record-sets \
--hosted-zone-id "$HOSTED_ZONE_ID" \
--change-batch '{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'$DOMAIN'",
"Type": "A",
"TTL": 60,
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "'$PRIMARY_ENDPOINT'",
"EvaluateTargetHealth": true
}
}
},
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'$DOMAIN'",
"Type": "A",
"TTL": 60,
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "'$SECONDARY_ENDPOINT'",
"EvaluateTargetHealth": false
}
}
}
]
}'
echo "Failover completed"
最佳实践
✅ DO
- 定期进行备份测试
- 使用多个备份位置
- 实施自动化备份
- 文档化恢复程序
- 定期测试故障转移程序
- 监控备份完成情况
- 使用不可变备份
- 加密备份数据在休息和传输中
❌ DON’T
- 依赖单一备份位置
- 忽略备份失败
- 将备份与生产数据一起存储
- 跳过测试恢复程序
- 超出恢复速度需求过度压缩备份
- 忘记验证备份完整性
- 将加密密钥与备份一起存储
- 假设备份自动工作