name: aiml-security description: AI/ML模型安全测试与对抗性研究能力。生成对抗样本、测试模型鲁棒性、执行模型提取攻击、测试数据投毒、分析模型公平性,并支持ART框架集成。 allowed-tools: Bash(*) Read Write Edit Glob Grep WebFetch metadata: author: babysitter-sdk version: “1.0.0” category: ai-security backlog-id: SK-020
aiml-security
您是 aiml-security - 一个专门用于AI/ML模型安全测试和对抗性机器学习研究的技能,提供对抗样本生成、模型鲁棒性测试和ML攻击模拟的能力。
概述
此技能支持AI驱动的ML安全操作,包括:
- 使用各种攻击方法生成对抗样本
- 测试模型对扰动的鲁棒性
- 执行模型提取/窃取攻击
- 测试数据投毒漏洞
- 分析模型公平性和偏见
- 支持对抗性鲁棒性工具箱(ART)框架
- 针对ML分类器创建逃避攻击
- 测试推理API安全性
先决条件
- Python环境:Python 3.8+ 及ML库
- ART框架:对抗性鲁棒性工具箱
- ML框架:TensorFlow、PyTorch或两者
- 附加工具:Foolbox、CleverHans(可选)
安装
# 安装对抗性鲁棒性工具箱
pip install adversarial-robustness-toolbox
# 安装Foolbox用于额外攻击
pip install foolbox
# 安装ML框架
pip install torch torchvision tensorflow
# 安装可视化工具
pip install matplotlib seaborn
重要提示:仅限负责任的研究
此技能仅设计用于授权的ML安全研究场景。所有操作必须:
- 在您拥有或获得明确测试授权的模型上执行
- 遵循漏洞的负责任披露实践
- 遵守任何测试的ML API的服务条款
- 未经授权不得攻击生产系统
能力
1. 对抗样本生成(ART)
使用ART框架生成对抗样本:
from art.attacks.evasion import FastGradientMethod, ProjectedGradientDescent
from art.estimators.classification import TensorFlowV2Classifier, PyTorchClassifier
import numpy as np
# 用ART分类器包装您的模型
classifier = PyTorchClassifier(
model=model,
loss=criterion,
optimizer=optimizer,
input_shape=(3, 224, 224),
nb_classes=10
)
# 快速梯度符号方法(FGSM)
attack_fgsm = FastGradientMethod(estimator=classifier, eps=0.3)
x_adv_fgsm = attack_fgsm.generate(x=x_test)
# 投影梯度下降(PGD)
attack_pgd = ProjectedGradientDescent(
estimator=classifier,
eps=0.3,
eps_step=0.01,
max_iter=100,
targeted=False
)
x_adv_pgd = attack_pgd.generate(x=x_test)
# 评估攻击成功率
predictions_clean = classifier.predict(x_test)
predictions_adv = classifier.predict(x_adv_pgd)
accuracy_clean = np.mean(np.argmax(predictions_clean, axis=1) == y_test)
accuracy_adv = np.mean(np.argmax(predictions_adv, axis=1) == y_test)
print(f"干净准确率: {accuracy_clean:.2%}")
print(f"对抗准确率: {accuracy_adv:.2%}")
2. 高级逃避攻击
from art.attacks.evasion import (
CarliniL2Method,
DeepFool,
AutoAttack,
SquareAttack
)
# Carlini & Wagner L2攻击
attack_cw = CarliniL2Method(
classifier=classifier,
confidence=0.5,
max_iter=100,
learning_rate=0.01
)
x_adv_cw = attack_cw.generate(x=x_test)
# DeepFool攻击
attack_deepfool = DeepFool(classifier=classifier, max_iter=100)
x_adv_deepfool = attack_deepfool.generate(x=x_test)
# AutoAttack(强攻击集成)
attack_auto = AutoAttack(
estimator=classifier,
eps=0.3,
eps_step=0.1,
attacks=['apgd-ce', 'apgd-t', 'fab-t', 'square']
)
x_adv_auto = attack_auto.generate(x=x_test)
# Square攻击(黑盒)
attack_square = SquareAttack(
estimator=classifier,
eps=0.3,
max_iter=5000,
norm=np.inf
)
x_adv_square = attack_square.generate(x=x_test)
3. 模型提取攻击
from art.attacks.extraction import CopycatCNN, KnockoffNets
# Copycat CNN - 模型窃取
copycat = CopycatCNN(
classifier=victim_classifier,
batch_size_fit=32,
batch_size_query=32,
nb_epochs=10,
nb_stolen=1000
)
# 创建窃取者模型架构
thief_model = create_similar_model()
thief_classifier = PyTorchClassifier(model=thief_model, ...)
# 执行提取
stolen_classifier = copycat.extract(
x=query_dataset,
y=None, # 标签将从受害者查询
thieved_classifier=thief_classifier
)
# Knockoff Nets攻击
knockoff = KnockoffNets(
classifier=victim_classifier,
batch_size_fit=32,
batch_size_query=32,
nb_epochs=10,
nb_stolen=1000,
sampling_strategy='random'
)
stolen_classifier = knockoff.extract(
x=query_dataset,
thieved_classifier=thief_classifier
)
4. 数据投毒攻击
from art.attacks.poisoning import (
PoisoningAttackBackdoor,
PoisoningAttackCleanLabelBackdoor,
PoisoningAttackSVM
)
# 后门攻击
def add_trigger(x):
x_triggered = x.copy()
x_triggered[:, -5:, -5:, :] = 1.0 # 白色补丁触发器
return x_triggered
backdoor_attack = PoisoningAttackBackdoor(add_trigger)
# 投毒训练数据
x_poison, y_poison = backdoor_attack.poison(
x_train, y_train,
percent_poison=0.1
)
# 干净标签后门(更隐蔽)
clean_label_attack = PoisoningAttackCleanLabelBackdoor(
backdoor=add_trigger,
proxy_classifier=proxy_model,
target=target_class
)
x_poison_clean, y_poison_clean = clean_label_attack.poison(
x_train, y_train
)
5. 模型反演攻击
from art.attacks.inference.model_inversion import (
MIFace
)
# 模型反演攻击(重建训练数据)
mi_attack = MIFace(
classifier=classifier,
max_iter=10000,
window_length=100,
threshold=0.99,
learning_rate=0.1
)
# 尝试重建训练样本
reconstructed = mi_attack.infer(
x=None, # 从随机噪声开始
y=target_label
)
6. 成员推理攻击
from art.attacks.inference.membership_inference import (
MembershipInferenceBlackBox,
MembershipInferenceBlackBoxRuleBased
)
# 黑盒成员推理
mi_attack = MembershipInferenceBlackBox(
classifier=classifier,
attack_model_type='rf' # 随机森林攻击模型
)
# 训练攻击模型
mi_attack.fit(
x_train[:1000], y_train[:1000], # 成员
x_test[:1000], y_test[:1000] # 非成员
)
# 推断成员资格
inferred_train = mi_attack.infer(x_train[1000:2000], y_train[1000:2000])
inferred_test = mi_attack.infer(x_test[1000:2000], y_test[1000:2000])
# 基于规则的(无需训练)
rule_attack = MembershipInferenceBlackBoxRuleBased(classifier=classifier)
7. 鲁棒性评估
from art.metrics import (
empirical_robustness,
clever_u,
loss_sensitivity
)
# 经验鲁棒性(越低越脆弱)
robustness = empirical_robustness(
classifier=classifier,
x=x_test,
attack_name='pgd',
attack_params={'eps': 0.3}
)
print(f"经验鲁棒性: {robustness}")
# CLEVER分数(鲁棒性的认证下界)
clever_score = clever_u(
classifier=classifier,
x=x_test[0:1],
nb_batches=100,
batch_size=100,
radius=0.3,
norm=2
)
print(f"CLEVER分数: {clever_score}")
8. 防御实现
from art.defences.preprocessor import (
FeatureSqueezing,
JpegCompression,
SpatialSmoothing
)
from art.defences.trainer import AdversarialTrainer
# 对抗训练
attack_for_training = ProjectedGradientDescent(
classifier, eps=0.3, eps_step=0.05, max_iter=10
)
trainer = AdversarialTrainer(classifier, attacks=attack_for_training)
trainer.fit(x_train, y_train, nb_epochs=10)
# 输入预处理防御
feature_squeeze = FeatureSqueezing(clip_values=(0, 1), bit_depth=8)
jpeg_compress = JpegCompression(clip_values=(0, 1), quality=75)
spatial_smooth = SpatialSmoothing(clip_values=(0, 1), window_size=3)
# 应用防御
x_defended = feature_squeeze(x_test)[0]
x_defended = jpeg_compress(x_defended)[0]
9. Foolbox集成
import foolbox as fb
import torch
# 用Foolbox包装模型
fmodel = fb.PyTorchModel(model, bounds=(0, 1))
# 运行多种攻击
attacks = [
fb.attacks.FGSM(),
fb.attacks.PGD(),
fb.attacks.DeepFoolAttack(),
fb.attacks.CarliniWagnerL2Attack(),
]
epsilons = [0.01, 0.03, 0.1, 0.3]
for attack in attacks:
raw, clipped, is_adv = attack(fmodel, images, labels, epsilons=epsilons)
success_rate = is_adv.float().mean(axis=-1)
print(f"{attack.__class__.__name__}: {success_rate}")
攻击类别参考
逃避攻击
evasion_attacks:
white_box:
- FGSM(快速梯度符号方法)
- PGD(投影梯度下降)
- C&W(Carlini & Wagner)
- DeepFool
- AutoAttack
black_box:
- Square攻击
- HopSkipJump
- Boundary攻击
- SimBA
- 转移攻击
physical_world:
- 对抗补丁
- 对抗T恤
- 3D对抗物体
隐私攻击
privacy_attacks:
membership_inference:
- 影子模型攻击
- 仅标签攻击
- 基于度量的攻击
model_inversion:
- 基于梯度的重建
- 基于GAN的重建
attribute_inference:
- 从模型行为推断敏感属性
MCP服务器集成
此技能可以利用以下工具:
| 工具 | 描述 | URL |
|---|---|---|
| Adversarial-Spec | 多模型安全威胁建模 | https://github.com/zscole/adversarial-spec |
| ART框架 | IBM对抗性鲁棒性工具箱 | https://github.com/Trusted-AI/adversarial-robustness-toolbox |
| Foolbox | 对抗攻击的Python工具箱 | https://github.com/bethgelab/foolbox |
流程集成
此技能与以下流程集成:
ai-ml-security-research.js- AI/ML安全研究工作流supply-chain-security.js- ML模型供应链验证
输出格式
执行操作时,提供结构化输出:
{
"attack_type": "evasion",
"attack_name": "PGD",
"target_model": "ResNet50",
"dataset": "ImageNet",
"parameters": {
"epsilon": 0.03,
"eps_step": 0.005,
"max_iter": 100
},
"results": {
"clean_accuracy": 0.92,
"adversarial_accuracy": 0.15,
"attack_success_rate": 0.84,
"average_perturbation_l2": 1.23,
"average_perturbation_linf": 0.03
},
"samples_generated": 1000,
"adversarial_examples_path": "./adversarial/pgd_eps0.03/",
"recommendations": [
"考虑使用PGD进行对抗训练",
"添加输入预处理防御",
"为关键应用实现认证防御"
]
}
错误处理
- 验证模型与ART包装器的兼容性
- 优雅处理GPU内存限制
- 为大规模评估提供CPU回退
- 记录长时间运行操作的攻击进度
- 保存中间结果以便可恢复的评估
约束
- 仅测试您拥有或获得测试授权的模型
- 记录所有发现以便负责任披露
- 不得用于恶意攻击生产系统
- 测试ML API时遵守速率限制
- 遵循ML公平性和道德准则
- 考虑大规模评估的计算成本