名称: blip-2-视觉-语言 描述: 视觉语言预训练框架,桥接冻结的图像编码器和大型语言模型。当您需要图像字幕、视觉问答、图像文本检索或多模态聊天,并具有最先进的零样本性能时使用。 版本: 1.0.0 作者: Orchestra Research 许可证: MIT 标签: [多模态, 视觉语言, 图像字幕, VQA, 零样本] 依赖: [transformers>=4.30.0, torch>=1.10.0, Pillow]
BLIP-2: 视觉语言预训练
使用Salesforce的BLIP-2进行视觉语言任务的全面指南,结合冻结的图像编码器和大型语言模型。
何时使用BLIP-2
使用BLIP-2时:
- 需要高质量图像字幕,带有自然描述
- 构建视觉问答(VQA)系统
- 要求零样本图像文本理解,无需任务特定训练
- 希望利用LLM推理进行视觉任务
- 构建多模态对话AI
- 需要图像文本检索或匹配
关键特性:
- Q-Former架构:轻量级查询变换器桥接视觉和语言
- 冻结主干效率:无需微调大型视觉/语言模型
- 多种LLM后端:OPT(2.7B, 6.7B)和FlanT5(XL, XXL)
- 零样本能力:强大性能,无需任务特定训练
- 高效训练:仅训练Q-Former(约188M参数)
- 最先进结果:在VQA基准测试中击败更大模型
改用替代方案:
- LLaVA:用于指令跟随多模态聊天
- InstructBLIP:用于改进指令跟随(BLIP-2后继者)
- GPT-4V/Claude 3:用于生产多模态聊天(专有)
- CLIP:用于简单图像文本相似性,无需生成
- Flamingo:用于少样本视觉学习
快速开始
安装
# HuggingFace Transformers(推荐)
pip install transformers accelerate torch Pillow
# 或LAVIS库(Salesforce官方)
pip install salesforce-lavis
基本图像字幕
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
# 加载模型和处理器
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
# 加载图像
image = Image.open("photo.jpg").convert("RGB")
# 生成字幕
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
视觉问答
# 询问图像问题
question = "这辆汽车是什么颜色的?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
使用LAVIS库
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image
# 加载模型
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
# 处理图像
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)
# 字幕
caption = model.generate({"image": image})
print(caption)
# VQA
question = txt_processors["eval"]("这张图像中有什么?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
核心概念
架构概述
BLIP-2架构:
┌─────────────────────────────────────────────────────────────┐
│ Q-Former │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ 学习查询(32个查询 × 768维) │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ 与图像特征的交叉注意力 │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ 自注意力层(变换器) │ │
│ └────────────────────────┬────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ 冻结视觉编码器 │ 冻结LLM │
│ (来自EVA-CLIP的ViT-G/14) │ (OPT或FlanT5) │
└─────────────────────────────────────────────────────────────┘
模型变体
| 模型 | LLM后端 | 大小 | 使用场景 |
|---|---|---|---|
blip2-opt-2.7b |
OPT-2.7B | ~4GB | 通用字幕、VQA |
blip2-opt-6.7b |
OPT-6.7B | ~8GB | 更好的推理 |
blip2-flan-t5-xl |
FlanT5-XL | ~5GB | 指令跟随 |
blip2-flan-t5-xxl |
FlanT5-XXL | ~13GB | 最佳质量 |
Q-Former组件
| 组件 | 描述 | 参数 |
|---|---|---|
| 学习查询 | 固定的可学习嵌入集 | 32 × 768 |
| 图像变换器 | 与视觉特征的交叉注意力 | ~108M |
| 文本变换器 | 文本自注意力 | ~108M |
| 线性投影 | 映射到LLM维度 | 可变 |
高级用法
批处理
from PIL import Image
import torch
# 加载多个图像
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"这张图像显示什么?",
"描述场景。",
"突出的颜色是什么?",
"这张图像中有人吗?"
]
# 处理批次
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
# 生成
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"问:{q}
答:{a}
")
控制生成
# 控制生成参数
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5, # 束搜索
no_repeat_ngram_size=2, # 避免重复
top_p=0.9, # 核心采样
temperature=0.7, # 创造性
do_sample=True, # 启用采样
)
# 对于确定性输出
generated_ids = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=False,
)
内存优化
# 8位量化
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-6.7b",
quantization_config=quantization_config,
device_map="auto"
)
# 4位量化(更激进)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xxl",
quantization_config=quantization_config,
device_map="auto"
)
图像文本匹配
# 使用LAVIS进行ITM(图像文本匹配)
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_image_text_matching",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text = txt_processors["eval"]("一只狗坐在草地上")
# 获取匹配分数
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"匹配概率:{itm_scores[:, 1].item():.3f}")
特征提取
# 使用Q-Former提取图像特征
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# 获取特征
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds # 形状:[1, 32, 768]
image_features = features.image_embeds_proj # 投影用于匹配
常见工作流
工作流1:图像字幕管道
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path
class ImageCaptioner:
def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def caption(self, image_path: str, prompt: str = None) -> str:
image = Image.open(image_path).convert("RGB")
if prompt:
inputs = self.processor(images=image, text=prompt, return_tensors="pt")
else:
inputs = self.processor(images=image, return_tensors="pt")
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def caption_batch(self, image_paths: list, prompt: str = None) -> list:
images = [Image.open(p).convert("RGB") for p in image_paths]
if prompt:
inputs = self.processor(
images=images,
text=[prompt] * len(images),
return_tensors="pt",
padding=True
)
else:
inputs = self.processor(images=images, return_tensors="pt", padding=True)
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(**inputs, max_new_tokens=50)
return self.processor.batch_decode(generated_ids, skip_special_tokens=True)
# 用法
captioner = ImageCaptioner()
# 单图像
caption = captioner.caption("photo.jpg")
print(f"字幕:{caption}")
# 带提示风格
caption = captioner.caption("photo.jpg", "详细描述")
print(f"详细:{caption}")
# 批处理
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
print(f"图像 {i+1}: {cap}")
工作流2:视觉问答系统
class VisualQA:
def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.current_image = None
self.current_inputs = None
def set_image(self, image_path: str):
"""加载图像用于多个问题。"""
self.current_image = Image.open(image_path).convert("RGB")
def ask(self, question: str) -> str:
"""询问关于当前图像的问题。"""
if self.current_image is None:
raise ValueError("未设置图像。请先调用set_image()。")
# 为FlanT5格式化问题
prompt = f"问题:{question} 答案:"
inputs = self.processor(
images=self.current_image,
text=prompt,
return_tensors="pt"
).to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def ask_multiple(self, questions: list) -> dict:
"""询问多个关于当前图像的问题。"""
return {q: self.ask(q) for q in questions}
# 用法
vqa = VisualQA()
vqa.set_image("scene.jpg")
# 询问问题
print(vqa.ask("这张图像中有哪些物体?"))
print(vqa.ask("天气怎么样?"))
print(vqa.ask("有多少人?"))
# 批问题
results = vqa.ask_multiple([
"主要主题是什么?",
"主导颜色是什么?",
"这是室内还是室外?"
])
工作流3:图像搜索/检索
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess
class ImageSearchEngine:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=self.device
)
self.image_features = []
self.image_paths = []
def index_images(self, image_paths: list):
"""从图像构建索引。"""
self.image_paths = image_paths
for path in image_paths:
image = Image.open(path).convert("RGB")
image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model.extract_features({"image": image}, mode="image")
# 使用投影特征进行匹配
self.image_features.append(
features.image_embeds_proj.mean(dim=1).cpu().numpy()
)
self.image_features = np.vstack(self.image_features)
def search(self, query: str, top_k: int = 5) -> list:
"""通过文本查询搜索图像。"""
# 获取文本特征
text = self.txt_processors["eval"](query)
text_input = {"text_input": [text]}
with torch.no_grad():
text_features = self.model.extract_features(text_input, mode="text")
text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()
# 计算相似性
similarities = np.dot(self.image_features, text_embeds.T).squeeze()
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(self.image_paths[i], similarities[i]) for i in top_indices]
# 用法
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
# 搜索
results = engine.search("海洋上的日落", top_k=5)
for path, score in results:
print(f"{path}: {score:.3f}")
输出格式
生成输出
# 直接生成返回令牌ID
generated_ids = model.generate(**inputs, max_new_tokens=50)
# 形状:[batch_size, sequence_length]
# 解码为文本
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
# 返回:字符串列表
特征提取输出
# Q-Former输出
features = model.extract_features({"image": image}, mode="image")
features.image_embeds # [B, 32, 768] - Q-Former输出
features.image_embeds_proj # [B, 32, 256] - 投影用于匹配
features.text_embeds # [B, seq_len, 768] - 文本特征
features.text_embeds_proj # [B, 256] - 投影文本(CLS)
性能优化
GPU内存需求
| 模型 | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|
| blip2-opt-2.7b | ~8GB | ~5GB | ~3GB |
| blip2-opt-6.7b | ~16GB | ~9GB | ~5GB |
| blip2-flan-t5-xl | ~10GB | ~6GB | ~4GB |
| blip2-flan-t5-xxl | ~26GB | ~14GB | ~8GB |
速度优化
# 如果可用,使用Flash Attention
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2", # 需要flash-attn
device_map="auto"
)
# 编译模型(PyTorch 2.0+)
model = torch.compile(model)
# 使用较小图像(如果质量允许)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
# 默认224x224,最优
常见问题
| 问题 | 解决方案 |
|---|---|
| CUDA OOM | 使用INT8/INT4量化,较小模型 |
| 生成慢 | 使用贪婪解码,减少max_new_tokens |
| 字幕差 | 尝试FlanT5变体,使用提示 |
| 幻觉 | 降低温度,使用束搜索 |
| 错误答案 | 重新措辞问题,提供上下文 |
参考
资源
- 论文: https://arxiv.org/abs/2301.12597
- GitHub (LAVIS): https://github.com/salesforce/LAVIS
- HuggingFace: https://huggingface.co/Salesforce/blip2-opt-2.7b
- 演示: https://huggingface.co/spaces/Salesforce/BLIP2
- InstructBLIP: https://arxiv.org/abs/2305.06500 (后继者)