Name: 云存储优化Skill
Rating: 5 (3 reviews)
Author: aj

云存储优化

概览

跨多个云服务提供商使用压缩、智能分层、数据分区和生命周期管理来优化云存储成本和性能。在保持可访问性和合规性要求的同时降低存储成本。

使用场景

降低存储成本
优化数据访问模式
实施分层存储策略
归档历史数据
提高数据检索性能
管理合规性要求
组织大型数据集
优化数据湖和数据仓库

实施示例

1. AWS S3存储优化

# 启用智能分层
aws s3api put-bucket-intelligent-tiering-configuration \
  --bucket my-bucket \
  --id OptimizedStorage \
  --intelligent-tiering-configuration '{
    "Id": "OptimizedStorage",
    "Filter": {"Prefix": "data/"},
    "Status": "Enabled",
    "Tierings": [
      {
        "Days": 90,
        "AccessTier": "ARCHIVE_ACCESS"
      },
      {
        "Days": 180,
        "AccessTier": "DEEP_ARCHIVE_ACCESS"
      }
    ]
  }'

# 分析存储使用情况
aws s3api list-bucket-metrics-configurations --bucket my-bucket

# 启用S3选择以优化成本
aws s3api put-bucket-metrics-configuration \
  --bucket my-bucket \
  --id EntireBucket \
  --metrics-configuration '{
    "Id": "EntireBucket",
    "Filter": {"Prefix": ""}
  }'

# 使用S3批量操作进行批量标记
aws s3control create-job \
  --account-id ACCOUNT_ID \
  --operation LambdaInvoke \
  --manifest '{
    "Spec": {"Format": "S3BatchOperations_CSV_20180820"},
    "Location": "s3://my-bucket/manifest.csv"
  }' \
  --report '{
    "Bucket": "s3://my-bucket/reports/",
    "Prefix": "batch-operation-",
    "Format": "Report_CSV_20180820",
    "Enabled": true,
    "ReportScope": "AllTasks"
  }'

2. 数据压缩和分区策略

# Python数据优化
import boto3
import gzip
import json
from datetime import datetime
import pandas as pd

class StorageOptimizer:
    def __init__(self, bucket_name):
        self.s3_client = boto3.client('s3')
        self.bucket = bucket_name

    def compress_and_upload(self, file_path, key):
        """压缩文件并上传到S3"""
        with open(file_path, 'rb') as f_in:
            with gzip.open(f_in, 'rb') as f_out:
                self.s3_client.put_object(
                    Bucket=self.bucket,
                    Key=f'{key}.gz',
                    Body=f_out.read(),
                    ContentEncoding='gzip',
                    ServerSideEncryption='AES256'
                )

    def partition_csv_data(self, csv_path, partition_columns):
        """按日期和其他列分区CSV"""
        df = pd.read_csv(csv_path)

        # 按日期分区
        df['date'] = pd.to_datetime(df['date'])

        for date, date_group in df.groupby(df['date'].dt.date):
            for partition_val, partition_group in date_group.groupby(partition_columns[0]):
                # Parquet格式（比CSV更高效）
                file_key = f"data/date={date}/category={partition_val}/data.parquet"

                partition_group.to_parquet(
                    f"/tmp/{partition_val}.parquet",
                    compression='snappy',
                    index=False
                )

                self.upload_parquet_file(f"/tmp/{partition_val}.parquet", file_key)

    def upload_parquet_file(self, local_path, s3_key):
        """上传Parquet文件并优化"""
        with open(local_path, 'rb') as data:
            self.s3_client.put_object(
                Bucket=self.bucket,
                Key=s3_key,
                Body=data.read(),
                ContentType='application/octet-stream',
                ServerSideEncryption='AES256',
                StorageClass='INTELLIGENT_TIERING'
            )

    def analyze_storage_patterns(self):
        """分析并推荐存储优化"""
        response = self.s3_client.list_objects_v2(
            Bucket=self.bucket,
            Prefix='data/'
        )

        stats = {
            'total_size': 0,
            'file_count': 0,
            'by_extension': {},
            'old_files': []
        }

        for obj in response.get('Contents', []):
            size = obj['Size']
            key = obj['Key']
            modified = obj['LastModified']

            stats['total_size'] += size
            stats['file_count'] += 1

            ext = key.split('.')[-1]
            stats['by_extension'][ext] = stats['by_extension'].get(ext, 0) + 1

            # 文件超过90天
            days_old = (datetime.now(modified.tzinfo) - modified).days
            if days_old > 90:
                stats['old_files'].append({
                    'key': key,
                    'size': size,
                    'days_old': days_old
                })

        return stats

    def implement_lifecycle_optimization(self):
        """实施全面生命周期策略"""
        lifecycle_config = {
            'Rules': [
                # 最近数据 - 标准
                {
                    'Id': 'KeepRecentStandard',
                    'Status': 'Enabled',
                    'Filter': {'Prefix': 'data/'},
                    'NoncurrentVersionTransition': {
                        'NoncurrentDays': 30,
                        'StorageClass': 'STANDARD_IA'
                    }
                },
                # 归档旧数据
                {
                    'Id': 'ArchiveOldData',
                    'Status': 'Enabled',
                    'Filter': {'Prefix': 'archive/'},
                    'Transitions': [
                        {
                            'Days': 30,
                            'StorageClass': 'STANDARD_IA'
                        },
                        {
                            'Days': 90,
                            'StorageClass': 'GLACIER'
                        },
                        {
                            'Days': 180,
                            'StorageClass': 'DEEP_ARCHIVE'
                        }
                    ],
                    'Expiration': {
                        'Days': 2555  # 7年
                    }
                },
                # 删除不完整的多部分上传
                {
                    'Id': 'CleanupIncompleteUploads',
                    'Status': 'Enabled',
                    'AbortIncompleteMultipartUpload': {
                        'DaysAfterInitiation': 7
                    }
                }
            ]
        }

        self.s3_client.put_bucket_lifecycle_configuration(
            Bucket=self.bucket,
            LifecycleConfiguration=lifecycle_config
        )

3. Terraform多云存储配置

# storage-optimization.tf

# AWS S3与分层
resource "aws_s3_bucket" "data_lake" {
  bucket = "my-data-lake-${data.aws_caller_identity.current.account_id}"
}

resource "aws_s3_bucket_intelligent_tiering_configuration" "archive" {
  bucket = aws_s3_bucket.data_lake.id
  name   = "archive-tiering"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }

  status = "Enabled"
}

# Azure Blob存储与生命周期
resource "azurerm_storage_account" "data_lake" {
  name                     = "mydatalake"
  resource_group_name      = azurerm_resource_group.main.name
  location                 = azurerm_resource_group.main.location
  account_tier             = "Standard"
  account_replication_type = "LRS"

  access_tier = "Hot"
}

resource "azurerm_storage_management_policy" "data_lifecycle" {
  storage_account_id = azurerm_storage_account.data_lake.id

  rule {
    name    = "ArchiveOldBlobs"
    enabled = true

    filters {
      prefix_match       = ["data/"]
      blob_index_match {
        name      = "age-days"
        operation = "=="
        value     = "90"
      }
    }

    actions {
      base_blob {
        tier_to_cool_after_days_since_modification_greater_than       = 30
        tier_to_archive_after_days_since_modification_greater_than     = 90
        delete_after_days_since_modification_greater_than              = 2555
      }

      snapshot {
        delete_after_days_since_creation_greater_than = 90
      }

      version {
        tier_to_cool_after_days_since_creation_greater_than       = 30
        tier_to_archive_after_days_since_creation_greater_than     = 90
        delete_after_days_since_creation_greater_than              = 365
      }
    }
  }
}

# GCP Cloud Storage与生命周期
resource "google_storage_bucket" "data_lake" {
  name     = "my-data-lake-${data.google_client_config.current.project}"
  location = "US"

  uniform_bucket_level_access = true
  storage_class               = "STANDARD"

  lifecycle_rule {
    action {
      type          = "SetStorageClass"
      storage_class = "NEARLINE"
    }

    condition {
      age = 30
    }
  }

  lifecycle_rule {
    action {
      type          = "SetStorageClass"
      storage_class = "COLDLINE"
    }

    condition {
      age = 90
    }
  }

  lifecycle_rule {
    action {
      type          = "Delete"
    }

    condition {
      age = 2555
    }
  }

  lifecycle_rule {
    action {
      type = "Delete"
    }

    condition {
      num_newer_versions = 3
      is_live            = false
    }
  }
}

data "aws_caller_identity" "current" {}
data "google_client_config" "current" {}

4. 数据湖分区策略

# 为数据湖优化分区
def create_partitioned_data_lake(source_file, bucket, format='parquet'):
    import pyarrow.parquet as pq
    import pyarrow as pa

    # 读取数据
    table = pq.read_table(source_file)
    df = table.to_pandas()

    # 创建分区
    partitions = {
        'year': df['date'].dt.year,
        'month': df['date'].dt.month,
        'day': df['date'].dt.day,
        'region': df['region']
    }

    # 按分区分组
    for year, year_group in df.groupby(partitions['year']):
        for month, month_group in year_group.groupby(partitions['month']):
            for day, day_group in month_group.groupby(partitions['day']):
                for region, region_group in day_group.groupby(partitions['region']):
                    # 创建分区路径
                    path = f"s3://{bucket}/data/year={year}/month={month:02d}/day={day:02d}/region={region}"

                    # 以Parquet格式保存并压缩
                    table = pa.Table.from_pandas(region_group)
                    pq.write_table(
                        table,
                        f"{path}/data.parquet",
                        compression='snappy',
                        use_dictionary=True
                    )

最佳实践

✅ 执行

使用Parquet或ORC格式进行分析
实施分层存储策略
按时间和可查询维度分区数据
为关键数据启用版本控制
使用压缩（gzip、snappy、brotli）
定期监控存储成本
实施数据生命周期策略
归档不常访问的数据

❌ 不要

存储未压缩的数据
长期保留原始日志
忽略存储优化
仅使用热存储层
存储重复数据
忘记删除旧的测试数据

成本优化提示

使用智能分层以应对变化的访问模式
归档超过90天的数据
使用云服务提供商的等效冷存储
删除不完整的多部分上传
使用云工具监控使用情况
在大上传前估算成本

云存储优化Skill cloud-storage-optimization

云存储优化

概览

使用场景

实施示例

1. AWS S3存储优化

2. 数据压缩和分区策略

3. Terraform多云存储配置

4. 数据湖分区策略

最佳实践

✅ 执行

❌ 不要

成本优化提示

资源