AI/ML模型安全测试工具Skill aiml-security

AI/ML模型安全测试与对抗性研究工具包,提供对抗样本生成、模型鲁棒性评估、数据投毒检测、模型提取攻击模拟等全面安全测试能力。支持ART框架、Foolbox集成,适用于机器学习安全研究、模型漏洞评估、AI系统安全加固。关键词:AI安全测试、对抗性机器学习、模型鲁棒性、对抗样本生成、数据投毒检测、模型提取攻击、ART框架、FGSM攻击、PGD攻击、成员推理攻击

安全审计 0 次安装 0 次浏览 更新于 2/26/2026

name: aiml-security description: AI/ML模型安全测试与对抗性研究能力。生成对抗样本、测试模型鲁棒性、执行模型提取攻击、测试数据投毒、分析模型公平性,并支持ART框架集成。 allowed-tools: Bash(*) Read Write Edit Glob Grep WebFetch metadata: author: babysitter-sdk version: “1.0.0” category: ai-security backlog-id: SK-020

aiml-security

您是 aiml-security - 一个专门用于AI/ML模型安全测试和对抗性机器学习研究的技能,提供对抗样本生成、模型鲁棒性测试和ML攻击模拟的能力。

概述

此技能支持AI驱动的ML安全操作,包括:

  • 使用各种攻击方法生成对抗样本
  • 测试模型对扰动的鲁棒性
  • 执行模型提取/窃取攻击
  • 测试数据投毒漏洞
  • 分析模型公平性和偏见
  • 支持对抗性鲁棒性工具箱(ART)框架
  • 针对ML分类器创建逃避攻击
  • 测试推理API安全性

先决条件

  • Python环境:Python 3.8+ 及ML库
  • ART框架:对抗性鲁棒性工具箱
  • ML框架:TensorFlow、PyTorch或两者
  • 附加工具:Foolbox、CleverHans(可选)

安装

# 安装对抗性鲁棒性工具箱
pip install adversarial-robustness-toolbox

# 安装Foolbox用于额外攻击
pip install foolbox

# 安装ML框架
pip install torch torchvision tensorflow

# 安装可视化工具
pip install matplotlib seaborn

重要提示:仅限负责任的研究

此技能仅设计用于授权的ML安全研究场景。所有操作必须:

  • 在您拥有或获得明确测试授权的模型上执行
  • 遵循漏洞的负责任披露实践
  • 遵守任何测试的ML API的服务条款
  • 未经授权不得攻击生产系统

能力

1. 对抗样本生成(ART)

使用ART框架生成对抗样本:

from art.attacks.evasion import FastGradientMethod, ProjectedGradientDescent
from art.estimators.classification import TensorFlowV2Classifier, PyTorchClassifier
import numpy as np

# 用ART分类器包装您的模型
classifier = PyTorchClassifier(
    model=model,
    loss=criterion,
    optimizer=optimizer,
    input_shape=(3, 224, 224),
    nb_classes=10
)

# 快速梯度符号方法(FGSM)
attack_fgsm = FastGradientMethod(estimator=classifier, eps=0.3)
x_adv_fgsm = attack_fgsm.generate(x=x_test)

# 投影梯度下降(PGD)
attack_pgd = ProjectedGradientDescent(
    estimator=classifier,
    eps=0.3,
    eps_step=0.01,
    max_iter=100,
    targeted=False
)
x_adv_pgd = attack_pgd.generate(x=x_test)

# 评估攻击成功率
predictions_clean = classifier.predict(x_test)
predictions_adv = classifier.predict(x_adv_pgd)
accuracy_clean = np.mean(np.argmax(predictions_clean, axis=1) == y_test)
accuracy_adv = np.mean(np.argmax(predictions_adv, axis=1) == y_test)
print(f"干净准确率: {accuracy_clean:.2%}")
print(f"对抗准确率: {accuracy_adv:.2%}")

2. 高级逃避攻击

from art.attacks.evasion import (
    CarliniL2Method,
    DeepFool,
    AutoAttack,
    SquareAttack
)

# Carlini & Wagner L2攻击
attack_cw = CarliniL2Method(
    classifier=classifier,
    confidence=0.5,
    max_iter=100,
    learning_rate=0.01
)
x_adv_cw = attack_cw.generate(x=x_test)

# DeepFool攻击
attack_deepfool = DeepFool(classifier=classifier, max_iter=100)
x_adv_deepfool = attack_deepfool.generate(x=x_test)

# AutoAttack(强攻击集成)
attack_auto = AutoAttack(
    estimator=classifier,
    eps=0.3,
    eps_step=0.1,
    attacks=['apgd-ce', 'apgd-t', 'fab-t', 'square']
)
x_adv_auto = attack_auto.generate(x=x_test)

# Square攻击(黑盒)
attack_square = SquareAttack(
    estimator=classifier,
    eps=0.3,
    max_iter=5000,
    norm=np.inf
)
x_adv_square = attack_square.generate(x=x_test)

3. 模型提取攻击

from art.attacks.extraction import CopycatCNN, KnockoffNets

# Copycat CNN - 模型窃取
copycat = CopycatCNN(
    classifier=victim_classifier,
    batch_size_fit=32,
    batch_size_query=32,
    nb_epochs=10,
    nb_stolen=1000
)

# 创建窃取者模型架构
thief_model = create_similar_model()
thief_classifier = PyTorchClassifier(model=thief_model, ...)

# 执行提取
stolen_classifier = copycat.extract(
    x=query_dataset,
    y=None,  # 标签将从受害者查询
    thieved_classifier=thief_classifier
)

# Knockoff Nets攻击
knockoff = KnockoffNets(
    classifier=victim_classifier,
    batch_size_fit=32,
    batch_size_query=32,
    nb_epochs=10,
    nb_stolen=1000,
    sampling_strategy='random'
)
stolen_classifier = knockoff.extract(
    x=query_dataset,
    thieved_classifier=thief_classifier
)

4. 数据投毒攻击

from art.attacks.poisoning import (
    PoisoningAttackBackdoor,
    PoisoningAttackCleanLabelBackdoor,
    PoisoningAttackSVM
)

# 后门攻击
def add_trigger(x):
    x_triggered = x.copy()
    x_triggered[:, -5:, -5:, :] = 1.0  # 白色补丁触发器
    return x_triggered

backdoor_attack = PoisoningAttackBackdoor(add_trigger)

# 投毒训练数据
x_poison, y_poison = backdoor_attack.poison(
    x_train, y_train,
    percent_poison=0.1
)

# 干净标签后门(更隐蔽)
clean_label_attack = PoisoningAttackCleanLabelBackdoor(
    backdoor=add_trigger,
    proxy_classifier=proxy_model,
    target=target_class
)
x_poison_clean, y_poison_clean = clean_label_attack.poison(
    x_train, y_train
)

5. 模型反演攻击

from art.attacks.inference.model_inversion import (
    MIFace
)

# 模型反演攻击(重建训练数据)
mi_attack = MIFace(
    classifier=classifier,
    max_iter=10000,
    window_length=100,
    threshold=0.99,
    learning_rate=0.1
)

# 尝试重建训练样本
reconstructed = mi_attack.infer(
    x=None,  # 从随机噪声开始
    y=target_label
)

6. 成员推理攻击

from art.attacks.inference.membership_inference import (
    MembershipInferenceBlackBox,
    MembershipInferenceBlackBoxRuleBased
)

# 黑盒成员推理
mi_attack = MembershipInferenceBlackBox(
    classifier=classifier,
    attack_model_type='rf'  # 随机森林攻击模型
)

# 训练攻击模型
mi_attack.fit(
    x_train[:1000], y_train[:1000],  # 成员
    x_test[:1000], y_test[:1000]     # 非成员
)

# 推断成员资格
inferred_train = mi_attack.infer(x_train[1000:2000], y_train[1000:2000])
inferred_test = mi_attack.infer(x_test[1000:2000], y_test[1000:2000])

# 基于规则的(无需训练)
rule_attack = MembershipInferenceBlackBoxRuleBased(classifier=classifier)

7. 鲁棒性评估

from art.metrics import (
    empirical_robustness,
    clever_u,
    loss_sensitivity
)

# 经验鲁棒性(越低越脆弱)
robustness = empirical_robustness(
    classifier=classifier,
    x=x_test,
    attack_name='pgd',
    attack_params={'eps': 0.3}
)
print(f"经验鲁棒性: {robustness}")

# CLEVER分数(鲁棒性的认证下界)
clever_score = clever_u(
    classifier=classifier,
    x=x_test[0:1],
    nb_batches=100,
    batch_size=100,
    radius=0.3,
    norm=2
)
print(f"CLEVER分数: {clever_score}")

8. 防御实现

from art.defences.preprocessor import (
    FeatureSqueezing,
    JpegCompression,
    SpatialSmoothing
)
from art.defences.trainer import AdversarialTrainer

# 对抗训练
attack_for_training = ProjectedGradientDescent(
    classifier, eps=0.3, eps_step=0.05, max_iter=10
)
trainer = AdversarialTrainer(classifier, attacks=attack_for_training)
trainer.fit(x_train, y_train, nb_epochs=10)

# 输入预处理防御
feature_squeeze = FeatureSqueezing(clip_values=(0, 1), bit_depth=8)
jpeg_compress = JpegCompression(clip_values=(0, 1), quality=75)
spatial_smooth = SpatialSmoothing(clip_values=(0, 1), window_size=3)

# 应用防御
x_defended = feature_squeeze(x_test)[0]
x_defended = jpeg_compress(x_defended)[0]

9. Foolbox集成

import foolbox as fb
import torch

# 用Foolbox包装模型
fmodel = fb.PyTorchModel(model, bounds=(0, 1))

# 运行多种攻击
attacks = [
    fb.attacks.FGSM(),
    fb.attacks.PGD(),
    fb.attacks.DeepFoolAttack(),
    fb.attacks.CarliniWagnerL2Attack(),
]

epsilons = [0.01, 0.03, 0.1, 0.3]

for attack in attacks:
    raw, clipped, is_adv = attack(fmodel, images, labels, epsilons=epsilons)
    success_rate = is_adv.float().mean(axis=-1)
    print(f"{attack.__class__.__name__}: {success_rate}")

攻击类别参考

逃避攻击

evasion_attacks:
  white_box:
    - FGSM(快速梯度符号方法)
    - PGD(投影梯度下降)
    - C&W(Carlini & Wagner)
    - DeepFool
    - AutoAttack

  black_box:
    - Square攻击
    - HopSkipJump
    - Boundary攻击
    - SimBA
    - 转移攻击

  physical_world:
    - 对抗补丁
    - 对抗T恤
    - 3D对抗物体

隐私攻击

privacy_attacks:
  membership_inference:
    - 影子模型攻击
    - 仅标签攻击
    - 基于度量的攻击

  model_inversion:
    - 基于梯度的重建
    - 基于GAN的重建

  attribute_inference:
    - 从模型行为推断敏感属性

MCP服务器集成

此技能可以利用以下工具:

工具 描述 URL
Adversarial-Spec 多模型安全威胁建模 https://github.com/zscole/adversarial-spec
ART框架 IBM对抗性鲁棒性工具箱 https://github.com/Trusted-AI/adversarial-robustness-toolbox
Foolbox 对抗攻击的Python工具箱 https://github.com/bethgelab/foolbox

流程集成

此技能与以下流程集成:

  • ai-ml-security-research.js - AI/ML安全研究工作流
  • supply-chain-security.js - ML模型供应链验证

输出格式

执行操作时,提供结构化输出:

{
  "attack_type": "evasion",
  "attack_name": "PGD",
  "target_model": "ResNet50",
  "dataset": "ImageNet",
  "parameters": {
    "epsilon": 0.03,
    "eps_step": 0.005,
    "max_iter": 100
  },
  "results": {
    "clean_accuracy": 0.92,
    "adversarial_accuracy": 0.15,
    "attack_success_rate": 0.84,
    "average_perturbation_l2": 1.23,
    "average_perturbation_linf": 0.03
  },
  "samples_generated": 1000,
  "adversarial_examples_path": "./adversarial/pgd_eps0.03/",
  "recommendations": [
    "考虑使用PGD进行对抗训练",
    "添加输入预处理防御",
    "为关键应用实现认证防御"
  ]
}

错误处理

  • 验证模型与ART包装器的兼容性
  • 优雅处理GPU内存限制
  • 为大规模评估提供CPU回退
  • 记录长时间运行操作的攻击进度
  • 保存中间结果以便可恢复的评估

约束

  • 仅测试您拥有或获得测试授权的模型
  • 记录所有发现以便负责任披露
  • 不得用于恶意攻击生产系统
  • 测试ML API时遵守速率限制
  • 遵循ML公平性和道德准则
  • 考虑大规模评估的计算成本