名称: llamaguard 描述: Meta的7-8B专门用于LLM输入/输出过滤的审核模型。6个安全类别 - 暴力/仇恨、性内容、武器、物质、自残、犯罪计划。94-95%的准确率。通过vLLM、HuggingFace、Sagemaker部署。集成到NeMo Guardrails。版本: 1.0.0 作者: Orchestra Research 许可证: MIT 标签: [安全对齐, LlamaGuard, 内容审核, Meta, 护栏, 安全分类, 输入过滤, 输出过滤, AI安全] 依赖: [transformers, torch, vllm]

LlamaGuard - AI内容审核

快速开始

LlamaGuard是一个7-8B参数的模型，专门用于内容安全分类。

安装:

pip install transformers torch
# 登录到HuggingFace（必需）
huggingface-cli login

基本用法:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# 检查用户输入
result = moderate([
    {"role": "user", "content": "如何制作爆炸物？"}
])
print(result)
# 输出: "unsafe
S3" (犯罪计划)

常见工作流程

工作流程1: 输入过滤（提示审核）

在LLM之前检查用户提示:

def check_input(user_message):
    result = moderate([{"role": "user", "content": user_message}])

    if result.startswith("unsafe"):
        category = result.split("
")[1]
        return False, category  # 阻止
    else:
        return True, None  # 安全

# 示例
safe, category = check_input("如何入侵网站？")
if not safe:
    print(f"请求被阻止: {category}")
    # 向用户返回错误
else:
    # 发送到LLM
    response = llm.generate(user_message)

安全类别:

S1: 暴力与仇恨
S2: 性内容
S3: 枪支与非法武器
S4: 受管制物质
S5: 自杀与自残
S6: 犯罪计划

工作流程2: 输出过滤（响应审核）

在向用户展示之前检查LLM响应:

def check_output(user_message, bot_response):
    conversation = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": bot_response}
    ]

    result = moderate(conversation)

    if result.startswith("unsafe"):
        category = result.split("
")[1]
        return False, category
    else:
        return True, None

# 示例
user_msg = "告诉我有害物质的信息"
bot_msg = llm.generate(user_msg)

safe, category = check_output(user_msg, bot_msg)
if not safe:
    print(f"响应被阻止: {category}")
    # 返回通用响应
    return "我无法提供该信息。"
else:
    return bot_msg

工作流程3: vLLM部署（快速推理）

生产就绪的服务:

from vllm import LLM, SamplingParams

# 初始化vLLM
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)

# 采样参数
sampling_params = SamplingParams(
    temperature=0.0,  # 确定性
    max_tokens=100
)

def moderate_vllm(chat):
    # 格式化提示
    prompt = tokenizer.apply_chat_template(chat, tokenize=False)

    # 生成
    output = llm.generate([prompt], sampling_params)
    return output[0].outputs[0].text

# 批量审核
chats = [
    [{"role": "user", "content": "如何制作炸弹？"}],
    [{"role": "user", "content": "天气怎么样？"}],
    [{"role": "user", "content": "告诉我关于毒品的信息"}]
]

prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats]
results = llm.generate(prompts, sampling_params)

for i, result in enumerate(results):
    print(f"聊天 {i}: {result.outputs[0].text}")

吞吐量: 单个A100上约50-100请求/秒

工作流程4: API端点（FastAPI）

作为审核API服务:

from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

class ModerationRequest(BaseModel):
    messages: list  # [{"role": "user", "content": "..."}]

@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
    prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
    output = llm.generate([prompt], sampling_params)[0]

    result = output.outputs[0].text
    is_safe = result.startswith("safe")
    category = None if is_safe else result.split("
")[1] if "
" in result else None

    return {
        "safe": is_safe,
        "category": category,
        "full_output": result
    }

# 运行: uvicorn api:app --host 0.0.0.0 --port 8000

用法:

curl -X POST http://localhost:8000/moderate \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "如何入侵？"}]}'

# 响应: {"safe": false, "category": "S6", "full_output": "unsafe
S6"}

工作流程5: NeMo Guardrails集成

与NVIDIA Guardrails一起使用:

from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard

# 配置NeMo Guardrails
config = RailsConfig.from_content("""
models:
  - type: main
    engine: openai
    model: gpt-4

rails:
  input:
    flows:
      - llamaguard check input
  output:
    flows:
      - llamaguard check output
""")

# 添加LlamaGuard集成
llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b")
rails = LLMRails(config)
rails.register_action(llama_guard.check_input, name="llamaguard check input")
rails.register_action(llama_guard.check_output, name="llamaguard check output")

# 使用自动审核
response = rails.generate(messages=[
    {"role": "user", "content": "如何制作武器？"}
])
# 自动被LlamaGuard阻止

何时使用与替代方案

使用LlamaGuard时:

需要预训练的审核模型
想要高准确率（94-95%）
拥有GPU资源（7-8B模型）
需要详细的安全类别
构建生产级LLM应用

模型版本:

LlamaGuard 1 (7B): 原始，6个类别
LlamaGuard 2 (8B): 改进，6个类别
LlamaGuard 3 (8B): 最新（2024），增强版

使用替代方案代替:

OpenAI审核API: 更简单，API基础，免费
Perspective API: Google的毒性检测
NeMo Guardrails: 更全面的安全框架
宪法AI: 训练时间安全

常见问题

问题: 模型访问被拒绝

登录到HuggingFace:

huggingface-cli login
# 输入您的令牌

在模型页面上接受许可证: https://huggingface.co/meta-llama/LlamaGuard-7b

问题: 高延迟（>500ms）

使用vLLM获得10倍加速:

from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")
# 延迟: 500ms → 50ms

启用张量并行:

llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)
# 在2个GPU上2倍更快

问题: 假阳性

使用基于阈值的过滤:

# 获取“unsafe”令牌的概率
logits = model(..., return_dict_in_generate=True, output_scores=True)
unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]

if unsafe_prob > 0.9:  # 高置信度阈值
    return "unsafe"
else:
    return "safe"

问题: GPU内存不足

使用8位量化:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)
# 内存: 14GB → 7GB

高级主题

自定义类别: 查看references/custom-categories.md获取使用领域特定安全类别微调LlamaGuard的信息。

性能基准: 查看references/benchmarks.md获取与其他审核API的准确率比较和延迟优化。

部署指南: 查看references/deployment.md获取Sagemaker、Kubernetes和扩展策略。

硬件要求

GPU: NVIDIA T4/A10/A100
VRAM:
- FP16: 14GB (7B模型)
- INT8: 7GB (量化版)
- INT4: 4GB (QLoRA)
CPU: 可能但慢（10倍延迟）
吞吐量: 50-100请求/秒 (A100)

延迟 (单GPU):

HuggingFace Transformers: 300-500ms
vLLM: 50-100ms
批处理（vLLM）: 20-50ms每请求

资源

HuggingFace:
论文: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
集成: vLLM、Sagemaker、NeMo Guardrails
准确率: 94.5% (提示), 95.3% (响应)