PyTorch模型训练器 pytorch-trainer

PyTorch模型训练器是一个专注于深度学习模型训练的专业技能工具。它提供完整的自定义训练循环、高效的梯度管理(包括裁剪和累积)、GPU优化加速以及混合精度训练支持。该工具集成了学习率调度、检查点管理、多GPU分布式训练和早停机制,并能与主流实验跟踪系统无缝对接。适用于机器学习工程师、AI研究员进行模型训练、AutoML流水线编排和分布式训练任务。关键词:PyTorch训练,深度学习,模型训练,GPU优化,梯度管理,混合精度,分布式训练,AutoML。

深度学习 0 次安装 0 次浏览 更新于 2/23/2026

name: pytorch-trainer description: PyTorch模型训练技能,包含自定义训练循环、梯度管理和GPU优化。 allowed-tools:

  • Read
  • Write
  • Bash
  • Glob
  • Grep

pytorch-trainer

概述

PyTorch模型训练技能,包含自定义训练循环、梯度管理、GPU优化以及与实验跟踪系统的集成。

能力

  • 自定义训练循环执行
  • 学习率调度(StepLR、CosineAnnealing、OneCycleLR等)
  • 梯度裁剪和累积
  • 混合精度训练(AMP)
  • 检查点管理和恢复
  • DataLoader优化
  • 多GPU训练(DataParallel、DistributedDataParallel)
  • 带耐心值的早停机制

目标流程

  • 带实验跟踪的模型训练流水线
  • 分布式训练编排
  • AutoML流水线编排

工具和库

  • PyTorch
  • PyTorch Lightning(可选)
  • torchvision, torchaudio, torchtext
  • CUDA工具包

输入模式

{
  "type": "object",
  "required": ["modelPath", "dataConfig", "trainingConfig"],
  "properties": {
    "modelPath": {
      "type": "string",
      "description": "模型定义文件的路径"
    },
    "dataConfig": {
      "type": "object",
      "properties": {
        "trainPath": { "type": "string" },
        "valPath": { "type": "string" },
        "batchSize": { "type": "integer" },
        "numWorkers": { "type": "integer" }
      }
    },
    "trainingConfig": {
      "type": "object",
      "properties": {
        "epochs": { "type": "integer" },
        "learningRate": { "type": "number" },
        "optimizer": { "type": "string" },
        "scheduler": { "type": "string" },
        "mixedPrecision": { "type": "boolean" },
        "gradientClipping": { "type": "number" },
        "gradientAccumulation": { "type": "integer" }
      }
    },
    "checkpointConfig": {
      "type": "object",
      "properties": {
        "saveDir": { "type": "string" },
        "saveEvery": { "type": "integer" },
        "resumeFrom": { "type": "string" }
      }
    }
  }
}

输出模式

{
  "type": "object",
  "required": ["status", "metrics", "checkpointPath"],
  "properties": {
    "status": {
      "type": "string",
      "enum": ["success", "error", "early_stopped"]
    },
    "metrics": {
      "type": "object",
      "properties": {
        "trainLoss": { "type": "number" },
        "valLoss": { "type": "number" },
        "trainAccuracy": { "type": "number" },
        "valAccuracy": { "type": "number" },
        "epochsTrained": { "type": "integer" },
        "trainingTime": { "type": "number" }
      }
    },
    "checkpointPath": {
      "type": "string"
    },
    "learningCurve": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "epoch": { "type": "integer" },
          "trainLoss": { "type": "number" },
          "valLoss": { "type": "number" }
        }
      }
    }
  }
}

使用示例

{
  kind: 'skill',
  title: '训练PyTorch模型',
  skill: {
    name: 'pytorch-trainer',
    context: {
      modelPath: 'models/resnet.py',
      dataConfig: {
        trainPath: 'data/train',
        valPath: 'data/val',
        batchSize: 32,
        numWorkers: 4
      },
      trainingConfig: {
        epochs: 100,
        learningRate: 0.001,
        optimizer: 'AdamW',
        scheduler: 'cosine',
        mixedPrecision: true,
        gradientClipping: 1.0
      }
    }
  }
}