以下是对’Error Recovery Skill’技能的中文翻译：

name: 错误恢复 description: 用于处理子代理故障的策略，包括重试逻辑和升级模式。 allowed-tools: 阅读，任务

错误恢复技能

用于优雅地处理子代理故障的模式，以及适当的重试策略。

何时加载此技能

您正在生成可能失败的子代理
子代理返回了错误或意外的输出
您需要决定是重试、升级还是中止

故障类别

类别	症状	策略
瞬时	超时，输出格式错误，解析错误	简单重试
上下文缺失	“我没有足够的信息”，任务不清晰	上下文增强
复杂性	部分完成，范围蔓延，离题	范围缩减
边界/合同	`status: blocked`，边界违规，合同变更	升级
致命	重复失败（3次以上），根本性误解	中止并报告

重试策略

策略1：简单重试

针对瞬时故障。相同的提示，最多3次尝试。

# 跟踪尝试次数
attempts: 0
max_attempts: 3

# 失败时
IF attempts < max_attempts:
  attempts += 1
  Task(same_subagent_type, same_model, same_prompt)
ELSE:
  标记为FAILED，继续

使用场景：

输出格式错误或截断
发生超时
代理返回空/无效响应

策略2：上下文增强

添加更多信息以帮助代理成功。

Task(
  subagent_type: "implementer",
  model: "sonnet",
  prompt: |
    ## PREVIOUS ATTEMPT FAILED

    错误：{error_message}
    收到的输出：{partial_output}

    ## ADDITIONAL CONTEXT

    以下是可能有助于代理的更多信息：
    - 相关文件：@{additional_file_path}
    - 遵循的模式：{example_pattern}
    - 具体指导：{clarification}

    ## ORIGINAL TASK

    {original_task_description}

    输出到：{output_path}
)

使用场景：

代理说“我不理解”或“要求不明确”
代理做出了错误的假设
代理在输出中提问

要添加的上下文：

代理可能需要的相关代码文件
作为示例的类似实现
对模糊点的明确澄清
上次尝试中的错误消息

策略3：范围缩减

将失败的任务分解成更小、更易管理的部分。

# 原始任务失败
任务："Implement full authentication system"

# 分解为子任务
Task(implementer, "Implement password hashing utility")
Task(implementer, "Implement session token generation")
Task(implementer, "Implement login endpoint")
Task(implementer, "Implement logout endpoint")

使用场景：

代理完成了部分工作然后失败了
任务描述太宽泛
代理偏离了主题
输出显示对范围的混淆

分解指南：

每个子任务应该可以独立完成
每个子任务应该有明确的界限
如果没有依赖关系，子任务可以并行运行
所有子任务完成后重新组合输出

策略4：升级

将问题路由到专门的代理进行解决。

# 对于边界违规
Task(
  subagent_type: "contract-resolver",
  model: "sonnet",
  prompt: |
    由于边界/合同问题，任务被阻塞。

    被阻塞的任务输出：memory/tasks/{task_id}/output.json
    被阻塞的原因：{blocked_reason}
    当前合同：{contract_paths}

    分析影响并提供解决方案。
    输出到：memory/contracts/resolution_{task_id}.json
)

升级路径：

故障类型	升级到	行动
`blocked_reason: boundary_violation`	contract-resolver	扩展边界或重新设计
`blocked_reason: contract_change`	contract-resolver	修改合同，重新验证依赖项
`blocked_reason: dependency_issue`	executor (self)	重新检查依赖状态
重复实现失败	architect	重新考虑设计方法

策略5：中止并报告

当恢复不可能时，优雅地失败。

{"tasks":[{"id":"{task_id}","status":"failed","failure_reason":"{specific reason}","attempts_made":3,"recovery_attempted":[{"strategy":"simple_retry","result":"same_error"},{"strategy":"context_enhancement","result":"different_error"},{"strategy":"scope_reduction","result":"subtasks_also_failed"}],"recommendation":"Task may need architectural redesign"}]}

使用场景：

3次以上重试尝试失败
所有策略都失败了
对需求的根本性误解
根据约束条件，任务实际上是不可能的

决策树

On Subagent Failure:
│
├─ Is output malformed/empty/timeout?
│  └─ YES → Strategy 1: Simple Retry (up to 3x)
│
├─ Did agent say "unclear" or ask questions?
│  └─ YES → Strategy 2: Context Enhancement
│
├─ Did agent complete partial work?
│  └─ YES → Strategy 3: Scope Reduction
│
├─ Is status "blocked" with boundary/contract reason?
│  └─ YES → Strategy 4: Escalation to contract-resolver
│
├─ Have we tried 3+ strategies already?
│  └─ YES → Strategy 5: Abort with Report
│
└─ Unknown error
   └─ Try Strategy 2 first, then escalate

重试状态跟踪

在执行状态文件中跟踪重试尝试：

{"tasks":[{"id":"task-001","status":"running","attempts":2,"last_error":"Timeout after 120s","retry_strategy":"simple_retry"},{"id":"task-002","status":"running","attempts":1,"last_error":"Needs access to src/config/db.ts","retry_strategy":"context_enhancement","context_added":["src/config/db.ts","src/types/config.ts"]}]}

与执行器循环集成

# 增强的执行循环
WHILE tasks remain incomplete:
  1. Read state file
  2. Find ready tasks
  3. Spawn ready tasks
  4. Check completed tasks:
     FOR each completed task:
       IF status == pre_complete:
         spawn verifier
       ELIF status == blocked:
         apply Strategy 4 (Escalation)
       ELIF status == failed:
         determine_failure_category()
         apply_appropriate_strategy()
         update_retry_state()
  5. Update state file
  6. IF all verified: EXIT
  7. IF all failed with no recovery: EXIT with failure report

原则

快速失败，智能恢复 - 不要盲目重试；先分析失败原因
保留部分工作 - 如果代理完成了50%，不要丢弃它
尽早升级 - 边界/合同问题需要解决者，而不是重试
跟踪一切 - 记录所有尝试，以便在反思阶段进行
知道何时退出 - 3个失败的策略 = 中止，不要永远循环