name: evaluating-code-models description: 使用pass@k指标评估代码生成模型在HumanEval、MBPP、MultiPL-E和15+基准测试上的性能。用于基准测试代码模型、比较编码能力、测试多语言支持或衡量代码生成质量。HuggingFace排行榜使用的BigCode项目行业标准。 version: 1.0.0 author: Orchestra Research license: MIT tags: [评估, 代码生成, HumanEval, MBPP, MultiPL-E, Pass@k, BigCode, 基准测试, 代码模型] dependencies: [bigcode-evaluation-harness, transformers>=4.25.1, accelerate>=0.13.2, datasets>=2.6.1]

BigCode评估工具 - 代码模型基准测试

快速开始

BigCode评估工具评估代码生成模型在15+基准测试上的性能，包括HumanEval、MBPP和MultiPL-E（18种语言）。

安装:

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config

在HumanEval上评估:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

查看可用任务:

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

常见工作流程

工作流程1: 标准代码基准测试评估

评估模型在核心代码基准测试（HumanEval、MBPP、HumanEval+）上的性能。

清单:

代码基准测试评估:
- [ ] 步骤1: 选择基准测试套件
- [ ] 步骤2: 配置模型和生成参数
- [ ] 步骤3: 运行带代码执行的评估
- [ ] 步骤4: 分析pass@k结果

步骤1: 选择基准测试套件

Python代码生成（最常见）:

HumanEval: 164个手写问题，函数补全
HumanEval+: 相同164个问题，但测试多80倍（更严格）
MBPP: 500个众包问题，入门级难度
MBPP+: 399个精选问题，测试多35倍

多语言（18种语言）:

MultiPL-E: HumanEval/MBPP翻译到C++、Java、JavaScript、Go、Rust等

高级:

APPS: 10,000个问题（入门/面试/竞赛级别）
DS-1000: 1,000个数据科学问题，覆盖7个库

步骤2: 配置模型和生成参数

# 标准HuggingFace模型
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution

# 量化模型（4位）
accelerate launch main.py \
  --model codellama/CodeLlama-34b-hf \
  --tasks humaneval \
  --load_in_4bit \
  --max_length_generation 512 \
  --allow_code_execution

# 自定义/私有模型
accelerate launch main.py \
  --model /path/to/my-code-model \
  --tasks humaneval \
  --trust_remote_code \
  --use_auth_token \
  --allow_code_execution

步骤3: 运行评估

# 完整评估，带pass@k估计（k=1,10,100）
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --temperature 0.8 \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution \
  --save_generations \
  --metric_output_path results/starcoder2-humaneval.json

步骤4: 分析结果

结果在results/starcoder2-humaneval.json中:

{
  "humaneval": {
    "pass@1": 0.354,
    "pass@10": 0.521,
    "pass@100": 0.689
  },
  "config": {
    "model": "bigcode/starcoder2-7b",
    "temperature": 0.8,
    "n_samples": 200
  }
}

工作流程2: 多语言评估（MultiPL-E）

评估跨18种编程语言的代码生成。

清单:

多语言评估:
- [ ] 步骤1: 在主机上生成解决方案
- [ ] 步骤2: 在Docker中运行评估（安全执行）
- [ ] 步骤3: 跨语言比较

步骤1: 在主机上生成解决方案

# 生成但不执行（安全）
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --max_length_generation 650 \
  --temperature 0.8 \
  --n_samples 50 \
  --batch_size 50 \
  --generation_only \
  --save_generations \
  --save_generations_path generations_multi.json

步骤2: 在Docker容器中评估

# 拉取MultiPL-E Docker镜像
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

# 在容器内运行评估
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
  -it evaluation-harness-multiple python3 main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --load_generations_path /app/generations.json \
  --allow_code_execution \
  --n_samples 50

支持的语言: Python、JavaScript、Java、C++、Go、Rust、TypeScript、C#、PHP、Ruby、Swift、Kotlin、Scala、Perl、Julia、Lua、R、Racket

工作流程3: 指令调优模型评估

用适当格式评估聊天/指令模型。

清单:

指令模型评估:
- [ ] 步骤1: 使用指令调优任务
- [ ] 步骤2: 配置指令令牌
- [ ] 步骤3: 运行评估

步骤1: 选择指令任务

instruct-humaneval: 带指令提示的HumanEval
humanevalsynthesize-{lang}: HumanEvalPack合成任务

步骤2: 配置指令令牌

# 对于带聊天模板的模型（例如CodeLlama-Instruct）
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks instruct-humaneval \
  --instruction_tokens "<s>[INST],</s>,[/INST]" \
  --max_length_generation 512 \
  --allow_code_execution

步骤3: 指令模型的HumanEvalPack评估

# 测试跨6种语言的代码合成
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks humanevalsynthesize-python,humanevalsynthesize-js \
  --prompt instruct \
  --max_length_generation 512 \
  --allow_code_execution

工作流程4: 比较多个模型

用于模型比较的基准测试套件。

步骤1: 创建评估脚本

#!/bin/bash
# eval_models.sh

MODELS=(
  "bigcode/starcoder2-7b"
  "codellama/CodeLlama-7b-hf"
  "deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do
  model_name=$(echo $model | tr '/' '-')
  echo "评估 $model"

  accelerate launch main.py \
    --model $model \
    --tasks $TASKS \
    --temperature 0.2 \
    --n_samples 20 \
    --batch_size 20 \
    --allow_code_execution \
    --metric_output_path results/${model_name}.json
done

步骤2: 生成比较表格

import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

何时使用 vs 替代方案

使用BigCode评估工具时:

专门评估代码生成模型
需要多语言评估（通过MultiPL-E支持18种语言）
通过单元测试（pass@k）测试功能正确性
为BigCode/HuggingFace排行榜进行基准测试
评估中间填充（FIM）能力

使用替代方案时:

lm-evaluation-harness: 通用LLM基准测试（MMLU、GSM8K、HellaSwag）
EvalPlus: 更严格的HumanEval+/MBPP+，带更多测试用例
SWE-bench: 真实世界GitHub问题解决
LiveCodeBench: 无污染、持续更新的问题
CodeXGLUE: 代码理解任务（克隆检测、缺陷预测）

支持的基准测试

基准测试	问题数	语言	指标	使用场景
HumanEval	164	Python	pass@k	标准代码补全
HumanEval+	164	Python	pass@k	更严格评估（80×测试）
MBPP	500	Python	pass@k	入门级问题
MBPP+	399	Python	pass@k	更严格评估（35×测试）
MultiPL-E	164×18	18种语言	pass@k	多语言评估
APPS	10,000	Python	pass@k	竞赛级别
DS-1000	1,000	Python	pass@k	数据科学（pandas、numpy等）
HumanEvalPack	164×3×6	6种语言	pass@k	合成/修复/解释
Mercury	1,889	Python	效率	计算效率

常见问题

问题: 结果与论文中报告的不同

检查这些因素:

# 1. 验证n_samples（需要200以准确估计pass@k）
--n_samples 200

# 2. 检查温度（0.2用于类似贪婪，0.8用于采样）
--temperature 0.8

# 3. 验证任务名称完全匹配
--tasks humaneval  # 不是 "human_eval" 或 "HumanEval"

# 4. 检查max_length_generation
--max_length_generation 512  # 增加以处理更长问题

问题: CUDA内存不足

# 使用量化
--load_in_8bit
# 或
--load_in_4bit

# 减少批次大小
--batch_size 1

# 设置内存限制
--max_memory_per_gpu "20GiB"

问题: 代码执行挂起或超时

使用Docker进行安全执行:

# 在主机上生成（不执行）
--generation_only --save_generations

# 在Docker中评估
docker run ... --allow_code_execution --load_generations_path ...

问题: 指令模型得分低

确保正确的指令格式:

# 使用指令特定任务
--tasks instruct-humaneval

# 为模型设置指令令牌
--instruction_tokens "<s>[INST],</s>,[/INST]"

问题: MultiPL-E语言失败

使用专用Docker镜像:

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

命令参考

参数	默认值	描述
`--model`	-	HuggingFace模型ID或本地路径
`--tasks`	-	逗号分隔的任务名称
`--n_samples`	1	每个问题的样本数（200用于pass@k）
`--temperature`	0.2	采样温度
`--max_length_generation`	512	最大令牌数（提示 + 生成）
`--batch_size`	1	每个GPU的批次大小
`--allow_code_execution`	False	启用代码执行（必需）
`--generation_only`	False	生成但不评估
`--load_generations_path`	-	加载预生成的解决方案
`--save_generations`	False	保存生成的代码
`--metric_output_path`	results.json	指标的输出文件
`--load_in_8bit`	False	8位量化
`--load_in_4bit`	False	4位量化
`--trust_remote_code`	False	允许自定义模型代码
`--precision`	fp32	模型精度（fp32/fp16/bf16）

硬件要求

模型大小	VRAM（fp16）	VRAM（4位）	时间（HumanEval，n=200）
7B	14GB	6GB	~30分钟（A100）
13B	26GB	10GB	~1小时（A100）
34B	68GB	20GB	~2小时（A100）

资源

GitHub: https://github.com/bigcode-project/bigcode-evaluation-harness
文档: https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs
BigCode排行榜: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
HumanEval数据集: https://huggingface.co/datasets/openai/openai_humaneval
MultiPL-E: https://github.com/nuprl/MultiPL-E