Skip to content

6.1 视觉模型

学习时长:3-4 周

视觉模型是多模态 AI 应用的核心组件,使 AI 能够"看懂"图像和视频。本节覆盖从图像分类到视觉语言模型(VLM)的完整技术栈。


6.1.1 计算机视觉基础

核心任务分类

任务类型描述输入输出典型应用
图像分类判断图像属于哪个类别图像类别标签商品识别、医疗诊断
目标检测定位并识别图像中的物体图像边界框 + 类别自动驾驶、安防监控
语义分割像素级分类图像分割掩码医学影像、卫星图像
实例分割区分同类别的不同实例图像实例掩码机器人抓取、人群计数
图像生成从文本/噪声生成图像文本/噪声图像DALL-E、Midjourney
图像理解描述图像内容图像文本描述视觉问答、图像字幕

视觉模型发展历程

2012: AlexNet(深度学习视觉元年)

2014: VGG、GoogLeNet(更深的网络)

2015: ResNet(残差连接,突破深度限制)

2017: Transformer 提出(为 ViT 奠基)

2020: Vision Transformer(ViT)(纯 Attention 架构)

2021: CLIP(视觉-语言对齐)

2022: Stable Diffusion(文生图爆发)

2023: GPT-4V、LLaVA(视觉语言模型)

2024: Qwen-VL、InternVL(开源 VLM 崛起)

6.1.2 图像分类与特征提取

1. 使用预训练模型进行图像分类

方法 1:使用 Transformers 库(推荐)

python
# pip install transformers pillow torch

from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import torch

class ImageClassifier:
    """图像分类器"""
    
    def __init__(self, model_name: str = "microsoft/resnet-50"):
        """
        Args:
            model_name: 预训练模型名称
                - microsoft/resnet-50: ResNet-50(ImageNet)
                - google/vit-base-patch16-224: Vision Transformer
                - facebook/convnext-base-224: ConvNeXt
        """
        self.processor = AutoImageProcessor.from_pretrained(model_name)
        self.model = AutoModelForImageClassification.from_pretrained(model_name)
        self.model.eval()
    
    def classify(self, image_path: str, top_k: int = 5) -> list[dict]:
        """
        分类图像
        
        Args:
            image_path: 图像路径
            top_k: 返回前 k 个预测结果
        
        Returns:
            [{"label": "猫", "score": 0.95}, ...]
        """
        # 1. 加载图像
        image = Image.open(image_path).convert("RGB")
        
        # 2. 预处理
        inputs = self.processor(images=image, return_tensors="pt")
        
        # 3. 推理
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits
        
        # 4. 解析结果
        probs = torch.nn.functional.softmax(logits, dim=-1)[0]
        top_probs, top_indices = torch.topk(probs, top_k)
        
        results = []
        for prob, idx in zip(top_probs, top_indices):
            label = self.model.config.id2label[idx.item()]
            results.append({
                "label": label,
                "score": prob.item()
            })
        
        return results

# 使用示例
classifier = ImageClassifier(model_name="microsoft/resnet-50")

# 分类图像
results = classifier.classify("cat.jpg", top_k=3)

print("🖼️  图像分类结果:")
for i, result in enumerate(results, 1):
    print(f"{i}. {result['label']}: {result['score']:.2%}")

输出示例

🖼️  图像分类结果:
1. tabby cat: 68.50%
2. Egyptian cat: 18.30%
3. tiger cat: 9.20%

方法 2:使用 timm 库(更多模型选择)

python
# pip install timm

import timm
from PIL import Image
import torch
from timm.data import resolve_data_config
from timm.data.transforms_factory import create_transform

class TimmClassifier:
    """基于 timm 的图像分类器"""
    
    def __init__(self, model_name: str = "resnet50.a1_in1k"):
        """
        Args:
            model_name: timm 模型名称
                - resnet50.a1_in1k: ResNet-50
                - vit_base_patch16_224.augreg_in21k_ft_in1k: ViT-Base
                - convnext_base.fb_in22k_ft_in1k: ConvNeXt-Base
                - efficientnet_b0.ra_in1k: EfficientNet-B0
        """
        self.model = timm.create_model(model_name, pretrained=True)
        self.model.eval()
        
        # 获取预处理配置
        config = resolve_data_config({}, model=self.model)
        self.transform = create_transform(**config)
    
    def classify(self, image_path: str, top_k: int = 5) -> list[dict]:
        """分类图像"""
        # 加载并预处理
        image = Image.open(image_path).convert("RGB")
        input_tensor = self.transform(image).unsqueeze(0)
        
        # 推理
        with torch.no_grad():
            output = self.model(input_tensor)
            probs = torch.nn.functional.softmax(output, dim=1)[0]
        
        # 解析结果
        top_probs, top_indices = torch.topk(probs, top_k)
        
        # 获取 ImageNet 类别名称
        labels = timm.data.ImageNetInfo().index_to_label_name
        
        results = []
        for prob, idx in zip(top_probs, top_indices):
            label = labels[idx.item()]
            results.append({
                "label": label,
                "score": prob.item()
            })
        
        return results

# 使用示例
classifier = TimmClassifier(model_name="vit_base_patch16_224.augreg_in21k_ft_in1k")
results = classifier.classify("dog.jpg")

for result in results:
    print(f"• {result['label']}: {result['score']:.2%}")

2. 特征提取(用于相似度搜索)

python
import torch
import torch.nn.functional as F
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import numpy as np

class ImageFeatureExtractor:
    """图像特征提取器"""
    
    def __init__(self, model_name: str = "facebook/dinov2-base"):
        """
        Args:
            model_name: 特征提取模型
                - facebook/dinov2-base: DINOv2(推荐)
                - openai/clip-vit-base-patch32: CLIP
                - google/vit-base-patch16-224: ViT
        """
        self.processor = AutoImageProcessor.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.model.eval()
    
    def extract_features(self, image_path: str) -> np.ndarray:
        """
        提取图像特征向量
        
        Returns:
            特征向量(768 维或其他维度)
        """
        # 加载图像
        image = Image.open(image_path).convert("RGB")
        
        # 预处理
        inputs = self.processor(images=image, return_tensors="pt")
        
        # 提取特征
        with torch.no_grad():
            outputs = self.model(**inputs)
            # 使用 [CLS] token 的输出作为图像特征
            features = outputs.last_hidden_state[:, 0, :]
        
        # 归一化(用于余弦相似度计算)
        features = F.normalize(features, p=2, dim=1)
        
        return features.cpu().numpy()[0]
    
    def compute_similarity(
        self, 
        image1_path: str, 
        image2_path: str
    ) -> float:
        """
        计算两张图像的相似度
        
        Returns:
            相似度分数(0-1,越高越相似)
        """
        feat1 = self.extract_features(image1_path)
        feat2 = self.extract_features(image2_path)
        
        # 余弦相似度
        similarity = np.dot(feat1, feat2)
        
        return float(similarity)

# 使用示例
extractor = ImageFeatureExtractor()

# 提取特征
features = extractor.extract_features("image1.jpg")
print(f"特征维度: {features.shape}")  # (768,)

# 计算相似度
similarity = extractor.compute_similarity("cat1.jpg", "cat2.jpg")
print(f"相似度: {similarity:.4f}")  # 0.8523

3. 以图搜图系统

python
# pip install faiss-cpu

import faiss
import numpy as np
from pathlib import Path
from typing import List, Tuple
import pickle

class ImageSearchEngine:
    """以图搜图引擎"""
    
    def __init__(self, feature_extractor: ImageFeatureExtractor):
        self.extractor = feature_extractor
        self.index = None
        self.image_paths = []
    
    def build_index(self, image_dir: str):
        """
        构建图像索引
        
        Args:
            image_dir: 图像目录路径
        """
        print(f"🔨 构建索引:{image_dir}")
        
        # 1. 收集所有图像
        image_dir = Path(image_dir)
        image_files = list(image_dir.glob("*.jpg")) + \
                     list(image_dir.glob("*.png")) + \
                     list(image_dir.glob("*.jpeg"))
        
        print(f"📁 找到 {len(image_files)} 张图像")
        
        # 2. 提取特征
        features_list = []
        for i, img_path in enumerate(image_files, 1):
            if i % 100 == 0:
                print(f"  处理进度: {i}/{len(image_files)}")
            
            try:
                features = self.extractor.extract_features(str(img_path))
                features_list.append(features)
                self.image_paths.append(str(img_path))
            except Exception as e:
                print(f"  ⚠️  跳过 {img_path.name}: {e}")
        
        # 3. 构建 FAISS 索引
        features_matrix = np.array(features_list).astype('float32')
        dimension = features_matrix.shape[1]
        
        # 使用内积索引(因为特征已归一化,内积等于余弦相似度)
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(features_matrix)
        
        print(f"✅ 索引构建完成:{len(self.image_paths)} 张图像")
    
    def search(
        self, 
        query_image: str, 
        top_k: int = 5
    ) -> List[Tuple[str, float]]:
        """
        搜索相似图像
        
        Args:
            query_image: 查询图像路径
            top_k: 返回前 k 个结果
        
        Returns:
            [(图像路径, 相似度分数), ...]
        """
        if self.index is None:
            raise ValueError("请先调用 build_index() 构建索引")
        
        # 提取查询图像特征
        query_features = self.extractor.extract_features(query_image)
        query_features = query_features.reshape(1, -1).astype('float32')
        
        # 搜索
        distances, indices = self.index.search(query_features, top_k)
        
        # 整理结果
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            results.append((self.image_paths[idx], float(dist)))
        
        return results
    
    def save_index(self, save_path: str):
        """保存索引"""
        faiss.write_index(self.index, f"{save_path}.index")
        with open(f"{save_path}.paths", "wb") as f:
            pickle.dump(self.image_paths, f)
        print(f"💾 索引已保存:{save_path}")
    
    def load_index(self, load_path: str):
        """加载索引"""
        self.index = faiss.read_index(f"{load_path}.index")
        with open(f"{load_path}.paths", "rb") as f:
            self.image_paths = pickle.load(f)
        print(f"📂 索引已加载:{len(self.image_paths)} 张图像")

# 使用示例
extractor = ImageFeatureExtractor()
search_engine = ImageSearchEngine(extractor)

# 构建索引
search_engine.build_index("./image_dataset")

# 保存索引
search_engine.save_index("./image_index")

# 搜索相似图像
results = search_engine.search("query.jpg", top_k=5)

print("\n🔍 搜索结果:")
for i, (path, score) in enumerate(results, 1):
    print(f"{i}. {Path(path).name}: {score:.4f}")

输出示例

🔨 构建索引:./image_dataset
📁 找到 1000 张图像
  处理进度: 100/1000
  处理进度: 200/1000
  ...
✅ 索引构建完成:1000 张图像
💾 索引已保存:./image_index

🔍 搜索结果:
1. cat_001.jpg: 0.9823
2. cat_045.jpg: 0.9156
3. cat_089.jpg: 0.8934
4. cat_123.jpg: 0.8712
5. kitten_007.jpg: 0.8501

6.1.3 目标检测

1. 使用 YOLO 进行目标检测

python
# pip install ultralytics

from ultralytics import YOLO
from PIL import Image
import cv2
import numpy as np

class ObjectDetector:
    """目标检测器(基于 YOLO)"""
    
    def __init__(self, model_size: str = "yolov8n"):
        """
        Args:
            model_size: 模型大小
                - yolov8n: Nano(最快)
                - yolov8s: Small
                - yolov8m: Medium
                - yolov8l: Large
                - yolov8x: XLarge(最准)
        """
        self.model = YOLO(f"{model_size}.pt")
    
    def detect(
        self, 
        image_path: str, 
        conf_threshold: float = 0.25,
        save_path: str = None
    ) -> list[dict]:
        """
        检测图像中的物体
        
        Args:
            image_path: 图像路径
            conf_threshold: 置信度阈值
            save_path: 保存标注图像的路径(可选)
        
        Returns:
            [{"class": "person", "confidence": 0.95, "bbox": [x1, y1, x2, y2]}, ...]
        """
        # 推理
        results = self.model(image_path, conf=conf_threshold)
        
        # 解析结果
        detections = []
        for result in results:
            boxes = result.boxes
            for box in boxes:
                detection = {
                    "class": result.names[int(box.cls)],
                    "confidence": float(box.conf),
                    "bbox": box.xyxy[0].tolist()  # [x1, y1, x2, y2]
                }
                detections.append(detection)
        
        # 保存标注图像
        if save_path:
            annotated = results[0].plot()
            cv2.imwrite(save_path, annotated)
            print(f"💾 标注图像已保存:{save_path}")
        
        return detections
    
    def detect_video(
        self, 
        video_path: str, 
        output_path: str,
        conf_threshold: float = 0.25
    ):
        """
        检测视频中的物体
        
        Args:
            video_path: 视频路径
            output_path: 输出视频路径
            conf_threshold: 置信度阈值
        """
        # 打开视频
        cap = cv2.VideoCapture(video_path)
        
        # 获取视频信息
        fps = int(cap.get(cv2.CAP_PROP_FPS))
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        
        # 创建输出视频
        fourcc = cv2.VideoWriter_fourcc(*'mp4v')
        out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))
        
        print(f"🎥 处理视频:{total_frames} 帧")
        
        frame_count = 0
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            
            # 检测
            results = self.model(frame, conf=conf_threshold, verbose=False)
            annotated = results[0].plot()
            
            # 写入输出
            out.write(annotated)
            
            frame_count += 1
            if frame_count % 30 == 0:
                print(f"  进度: {frame_count}/{total_frames}")
        
        cap.release()
        out.release()
        
        print(f"✅ 视频处理完成:{output_path}")

# 使用示例
detector = ObjectDetector(model_size="yolov8n")

# 检测图像
detections = detector.detect("street.jpg", save_path="street_annotated.jpg")

print("🎯 检测结果:")
for det in detections:
    print(f"• {det['class']}: {det['confidence']:.2%} at {det['bbox']}")

# 检测视频
detector.detect_video("traffic.mp4", "traffic_annotated.mp4")

输出示例

🎯 检测结果:
• person: 95.30% at [120.5, 200.3, 250.8, 450.2]
• car: 89.20% at [300.1, 250.5, 500.3, 400.7]
• bicycle: 78.50% at [50.2, 300.1, 150.6, 420.3]
💾 标注图像已保存:street_annotated.jpg

2. 使用 Transformers 进行目标检测

python
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image, ImageDraw, ImageFont
import torch

class TransformersDetector:
    """基于 Transformers 的目标检测器"""
    
    def __init__(self, model_name: str = "facebook/detr-resnet-50"):
        """
        Args:
            model_name: 模型名称
                - facebook/detr-resnet-50: DETR
                - microsoft/conditional-detr-resnet-50: Conditional DETR
        """
        self.processor = DetrImageProcessor.from_pretrained(model_name)
        self.model = DetrForObjectDetection.from_pretrained(model_name)
        self.model.eval()
    
    def detect(
        self, 
        image_path: str, 
        threshold: float = 0.9
    ) -> list[dict]:
        """检测物体"""
        # 加载图像
        image = Image.open(image_path).convert("RGB")
        
        # 预处理
        inputs = self.processor(images=image, return_tensors="pt")
        
        # 推理
        with torch.no_grad():
            outputs = self.model(**inputs)
        
        # 后处理
        target_sizes = torch.tensor([image.size[::-1]])
        results = self.processor.post_process_object_detection(
            outputs, 
            target_sizes=target_sizes, 
            threshold=threshold
        )[0]
        
        # 解析结果
        detections = []
        for score, label, box in zip(
            results["scores"], 
            results["labels"], 
            results["boxes"]
        ):
            detections.append({
                "class": self.model.config.id2label[label.item()],
                "confidence": score.item(),
                "bbox": box.tolist()
            })
        
        return detections
    
    def visualize(
        self, 
        image_path: str, 
        detections: list[dict],
        save_path: str
    ):
        """可视化检测结果"""
        image = Image.open(image_path).convert("RGB")
        draw = ImageDraw.Draw(image)
        
        for det in detections:
            bbox = det["bbox"]
            label = f"{det['class']}: {det['confidence']:.2f}"
            
            # 绘制边界框
            draw.rectangle(bbox, outline="red", width=3)
            
            # 绘制标签
            draw.text((bbox[0], bbox[1] - 10), label, fill="red")
        
        image.save(save_path)
        print(f"💾 可视化结果已保存:{save_path}")

# 使用示例
detector = TransformersDetector()
detections = detector.detect("image.jpg", threshold=0.9)

print("🎯 检测到的物体:")
for det in detections:
    print(f"• {det['class']}: {det['confidence']:.2%}")

detector.visualize("image.jpg", detections, "result.jpg")

6.1.4 视觉语言模型(VLM)

1. 使用 CLIP 进行零样本分类

python
# pip install transformers pillow torch

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

class CLIPClassifier:
    """CLIP 零样本图像分类器"""
    
    def __init__(self, model_name: str = "openai/clip-vit-base-patch32"):
        """
        Args:
            model_name: CLIP 模型
                - openai/clip-vit-base-patch32: CLIP ViT-B/32
                - openai/clip-vit-large-patch14: CLIP ViT-L/14
        """
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.model = CLIPModel.from_pretrained(model_name)
        self.model.eval()
    
    def classify(
        self, 
        image_path: str, 
        candidate_labels: list[str]
    ) -> list[dict]:
        """
        零样本分类
        
        Args:
            image_path: 图像路径
            candidate_labels: 候选类别列表
        
        Returns:
            [{"label": "猫", "score": 0.95}, ...]
        """
        # 加载图像
        image = Image.open(image_path).convert("RGB")
        
        # 构建文本提示
        texts = [f"a photo of a {label}" for label in candidate_labels]
        
        # 预处理
        inputs = self.processor(
            text=texts, 
            images=image, 
            return_tensors="pt", 
            padding=True
        )
        
        # 推理
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits_per_image = outputs.logits_per_image
            probs = logits_per_image.softmax(dim=1)[0]
        
        # 解析结果
        results = []
        for label, prob in zip(candidate_labels, probs):
            results.append({
                "label": label,
                "score": prob.item()
            })
        
        # 按分数排序
        results.sort(key=lambda x: x["score"], reverse=True)
        
        return results
    
    def search_text_in_images(
        self, 
        image_paths: list[str], 
        text_query: str
    ) -> list[tuple[str, float]]:
        """
        文本搜索图像
        
        Args:
            image_paths: 图像路径列表
            text_query: 文本查询
        
        Returns:
            [(图像路径, 相似度分数), ...]
        """
        # 加载所有图像
        images = [Image.open(path).convert("RGB") for path in image_paths]
        
        # 预处理
        inputs = self.processor(
            text=[text_query], 
            images=images, 
            return_tensors="pt", 
            padding=True
        )
        
        # 推理
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits_per_image = outputs.logits_per_image
            probs = logits_per_image.softmax(dim=0)[:, 0]
        
        # 整理结果
        results = list(zip(image_paths, probs.tolist()))
        results.sort(key=lambda x: x[1], reverse=True)
        
        return results

# 使用示例
clip = CLIPClassifier()

# 零样本分类
labels = ["cat", "dog", "bird", "car", "tree"]
results = clip.classify("animal.jpg", candidate_labels=labels)

print("🏷️  分类结果:")
for result in results:
    print(f"• {result['label']}: {result['score']:.2%}")

# 文本搜索图像
image_list = ["img1.jpg", "img2.jpg", "img3.jpg"]
search_results = clip.search_text_in_images(image_list, "a cat sleeping on a sofa")

print("\n🔍 搜索结果:")
for path, score in search_results:
    print(f"• {path}: {score:.4f}")

输出示例

🏷️  分类结果:
• cat: 78.50%
• dog: 12.30%
• bird: 5.20%
• car: 2.80%
• tree: 1.20%

🔍 搜索结果:
• img2.jpg: 0.8923
• img1.jpg: 0.3456
• img3.jpg: 0.1234

2. 使用 LLaVA 进行视觉问答

python
# pip install transformers torch pillow

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image
import torch

class VisionQA:
    """视觉问答系统(基于 LLaVA)"""
    
    def __init__(self, model_name: str = "llava-hf/llava-v1.6-mistral-7b-hf"):
        """
        Args:
            model_name: VLM 模型
                - llava-hf/llava-v1.6-mistral-7b-hf: LLaVA-1.6 (7B)
                - llava-hf/llava-v1.6-vicuna-13b-hf: LLaVA-1.6 (13B)
        """
        self.processor = LlavaNextProcessor.from_pretrained(model_name)
        self.model = LlavaNextForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    def answer(
        self, 
        image_path: str, 
        question: str,
        max_new_tokens: int = 200
    ) -> str:
        """
        回答关于图像的问题
        
        Args:
            image_path: 图像路径
            question: 问题
            max_new_tokens: 最大生成长度
        
        Returns:
            答案文本
        """
        # 加载图像
        image = Image.open(image_path).convert("RGB")
        
        # 构建提示
        prompt = f"USER: <image>\n{question}\nASSISTANT:"
        
        # 预处理
        inputs = self.processor(
            text=prompt, 
            images=image, 
            return_tensors="pt"
        ).to(self.model.device)
        
        # 生成答案
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False
            )
        
        # 解码
        answer = self.processor.decode(
            outputs[0], 
            skip_special_tokens=True
        )
        
        # 提取 ASSISTANT 后的内容
        answer = answer.split("ASSISTANT:")[-1].strip()
        
        return answer
    
    def describe_image(self, image_path: str) -> str:
        """生成图像描述"""
        return self.answer(
            image_path, 
            "Describe this image in detail."
        )
    
    def count_objects(self, image_path: str, object_name: str) -> str:
        """计数图像中的物体"""
        return self.answer(
            image_path,
            f"How many {object_name}s are in this image?"
        )

# 使用示例
vqa = VisionQA()

# 视觉问答
answer = vqa.answer("kitchen.jpg", "What is on the table?")
print(f"Q: What is on the table?")
print(f"A: {answer}\n")

# 图像描述
description = vqa.describe_image("landscape.jpg")
print(f"图像描述:\n{description}\n")

# 物体计数
count = vqa.count_objects("parking_lot.jpg", "car")
print(f"Q: How many cars?")
print(f"A: {count}")

输出示例

Q: What is on the table?
A: There is a laptop, a coffee cup, and some books on the table.

图像描述:
This image shows a beautiful mountain landscape during sunset. The sky is painted with vibrant orange and pink hues, while snow-capped peaks stand majestically in the background. In the foreground, there's a serene lake reflecting the colorful sky.

Q: How many cars?
A: There are 7 cars visible in the parking lot.

3. 使用 Qwen-VL 进行多模态理解

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import torch

class QwenVL:
    """Qwen-VL 多模态模型"""
    
    def __init__(self, model_name: str = "Qwen/Qwen-VL-Chat"):
        """
        Args:
            model_name: 模型名称
                - Qwen/Qwen-VL-Chat: Qwen-VL 对话版
                - Qwen/Qwen-VL: Qwen-VL 基础版
        """
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name, 
            trust_remote_code=True
        )
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            trust_remote_code=True,
            torch_dtype=torch.float16
        ).eval()
    
    def chat(
        self, 
        image_path: str, 
        query: str,
        history: list = None
    ) -> tuple[str, list]:
        """
        多轮对话
        
        Args:
            image_path: 图像路径
            query: 用户问题
            history: 对话历史
        
        Returns:
            (回答, 更新后的历史)
        """
        if history is None:
            history = []
        
        # 构建输入(Qwen-VL 特殊格式)
        query_with_image = f"<img>{image_path}</img>\n{query}"
        
        # 生成回答
        response, history = self.model.chat(
            self.tokenizer,
            query=query_with_image,
            history=history
        )
        
        return response, history
    
    def analyze_image(self, image_path: str) -> dict:
        """
        全面分析图像
        
        Returns:
            {
                "description": "图像描述",
                "objects": ["物体1", "物体2"],
                "scene": "场景类型",
                "text": "图像中的文字"
            }
        """
        results = {}
        
        # 1. 图像描述
        desc, _ = self.chat(image_path, "详细描述这张图片")
        results["description"] = desc
        
        # 2. 物体识别
        objects, _ = self.chat(image_path, "列出图片中的所有物体")
        results["objects"] = objects
        
        # 3. 场景识别
        scene, _ = self.chat(image_path, "这是什么场景?")
        results["scene"] = scene
        
        # 4. OCR(如果有文字)
        text, _ = self.chat(image_path, "图片中有什么文字?")
        results["text"] = text
        
        return results

# 使用示例
qwen_vl = QwenVL()

# 单轮问答
response, history = qwen_vl.chat("product.jpg", "这个产品是什么?")
print(f"Q: 这个产品是什么?")
print(f"A: {response}\n")

# 多轮对话
response, history = qwen_vl.chat(
    "product.jpg", 
    "它的主要特点是什么?", 
    history=history
)
print(f"Q: 它的主要特点是什么?")
print(f"A: {response}\n")

# 全面分析
analysis = qwen_vl.analyze_image("complex_scene.jpg")
print("📊 图像分析:")
for key, value in analysis.items():
    print(f"• {key}: {value}")

6.1.5 图像生成(文生图)

1. 使用 Stable Diffusion

python
# pip install diffusers transformers accelerate

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

class TextToImage:
    """文生图系统(Stable Diffusion)"""
    
    def __init__(
        self, 
        model_id: str = "stabilityai/stable-diffusion-2-1"
    ):
        """
        Args:
            model_id: 模型 ID
                - stabilityai/stable-diffusion-2-1: SD 2.1
                - runwayml/stable-diffusion-v1-5: SD 1.5
                - stabilityai/stable-diffusion-xl-base-1.0: SDXL
        """
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            safety_checker=None  # 禁用安全检查器(可选)
        )
        
        # 使用更快的调度器
        self.pipe.scheduler = DPMSolverMultistepScheduler.from_config(
            self.pipe.scheduler.config
        )
        
        # 移动到 GPU
        self.pipe = self.pipe.to("cuda")
        
        # 启用内存优化
        self.pipe.enable_attention_slicing()
    
    def generate(
        self,
        prompt: str,
        negative_prompt: str = None,
        num_images: int = 1,
        num_inference_steps: int = 30,
        guidance_scale: float = 7.5,
        width: int = 512,
        height: int = 512,
        seed: int = None
    ) -> list:
        """
        生成图像
        
        Args:
            prompt: 正向提示词
            negative_prompt: 负向提示词
            num_images: 生成图像数量
            num_inference_steps: 推理步数(越多越精细,但越慢)
            guidance_scale: 引导强度(越高越符合提示词)
            width, height: 图像尺寸
            seed: 随机种子(用于复现)
        
        Returns:
            PIL Image 列表
        """
        # 设置随机种子
        if seed is not None:
            generator = torch.Generator(device="cuda").manual_seed(seed)
        else:
            generator = None
        
        # 生成
        with torch.autocast("cuda"):
            images = self.pipe(
                prompt=prompt,
                negative_prompt=negative_prompt,
                num_images_per_prompt=num_images,
                num_inference_steps=num_inference_steps,
                guidance_scale=guidance_scale,
                width=width,
                height=height,
                generator=generator
            ).images
        
        return images

# 使用示例
t2i = TextToImage()

# 生成图像
prompt = "a beautiful sunset over the ocean, golden hour, photorealistic, 4k"
negative_prompt = "ugly, blurry, low quality, distorted"

images = t2i.generate(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_images=2,
    num_inference_steps=50,
    guidance_scale=7.5,
    seed=42
)

# 保存图像
for i, image in enumerate(images):
    image.save(f"generated_{i}.png")
    print(f"💾 已保存:generated_{i}.png")

2. 图像编辑(Inpainting)

python
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image
import torch

class ImageInpainter:
    """图像修复/编辑"""
    
    def __init__(self):
        self.pipe = StableDiffusionInpaintPipeline.from_pretrained(
            "stabilityai/stable-diffusion-2-inpainting",
            torch_dtype=torch.float16
        ).to("cuda")
    
    def inpaint(
        self,
        image_path: str,
        mask_path: str,
        prompt: str,
        save_path: str = "inpainted.png"
    ):
        """
        修复图像
        
        Args:
            image_path: 原始图像
            mask_path: 掩码图像(白色区域将被重绘)
            prompt: 重绘内容的描述
            save_path: 保存路径
        """
        # 加载图像和掩码
        image = Image.open(image_path).convert("RGB")
        mask = Image.open(mask_path).convert("L")
        
        # 生成
        result = self.pipe(
            prompt=prompt,
            image=image,
            mask_image=mask,
            num_inference_steps=50
        ).images[0]
        
        result.save(save_path)
        print(f"💾 修复结果已保存:{save_path}")

# 使用示例
inpainter = ImageInpainter()
inpainter.inpaint(
    image_path="photo.jpg",
    mask_path="mask.png",
    prompt="a red sports car",
    save_path="edited.png"
)

6.1.6 实战案例:智能图像分析系统

python
"""
综合案例:智能图像分析系统
功能:
1. 物体检测
2. 场景分类
3. 视觉问答
4. 图像描述生成
"""

from typing import Dict, List
import json

class IntelligentImageAnalyzer:
    """智能图像分析系统"""
    
    def __init__(self):
        print("🔧 初始化模型...")
        
        # 1. 物体检测
        self.detector = ObjectDetector(model_size="yolov8n")
        
        # 2. 场景分类
        self.classifier = ImageClassifier(model_name="microsoft/resnet-50")
        
        # 3. 视觉问答
        self.vqa = VisionQA()
        
        print("✅ 初始化完成")
    
    def analyze(self, image_path: str) -> Dict:
        """
        全面分析图像
        
        Returns:
            {
                "objects": [...],
                "scene": {...},
                "description": "...",
                "qa_results": {...}
            }
        """
        print(f"\n📸 分析图像:{image_path}\n")
        results = {}
        
        # 1. 物体检测
        print("🎯 步骤 1/4:物体检测")
        objects = self.detector.detect(image_path, conf_threshold=0.5)
        results["objects"] = objects
        print(f"✅ 检测到 {len(objects)} 个物体")
        
        # 2. 场景分类
        print("\n🏷️  步骤 2/4:场景分类")
        scene = self.classifier.classify(image_path, top_k=3)
        results["scene"] = scene
        print(f"✅ 场景:{scene[0]['label']} ({scene[0]['score']:.2%})")
        
        # 3. 图像描述
        print("\n📝 步骤 3/4:生成描述")
        description = self.vqa.describe_image(image_path)
        results["description"] = description
        print(f"✅ 描述:{description[:100]}...")
        
        # 4. 自动问答
        print("\n❓ 步骤 4/4:自动问答")
        qa_results = self._auto_qa(image_path, objects)
        results["qa_results"] = qa_results
        print(f"✅ 回答了 {len(qa_results)} 个问题")
        
        return results
    
    def _auto_qa(
        self, 
        image_path: str, 
        detected_objects: List[Dict]
    ) -> Dict[str, str]:
        """基于检测结果自动生成问答"""
        qa_results = {}
        
        # 问题 1:主要物体是什么?
        if detected_objects:
            main_object = detected_objects[0]["class"]
            qa_results["main_object"] = f"The main object is a {main_object}"
        
        # 问题 2:有多少人?
        person_count = sum(1 for obj in detected_objects if obj["class"] == "person")
        if person_count > 0:
            qa_results["person_count"] = f"There are {person_count} person(s)"
        
        # 问题 3:室内还是室外?
        answer = self.vqa.answer(image_path, "Is this indoors or outdoors?")
        qa_results["location_type"] = answer
        
        return qa_results
    
    def generate_report(self, image_path: str, save_path: str = None):
        """生成分析报告"""
        results = self.analyze(image_path)
        
        # 生成 Markdown 报告
        report = f"""# 图像分析报告

## 基本信息
- 图像路径:{image_path}

## 物体检测
检测到 {len(results['objects'])} 个物体:

"""
        for i, obj in enumerate(results['objects'][:10], 1):
            report += f"{i}. **{obj['class']}** (置信度: {obj['confidence']:.2%})\n"
        
        report += f"""

## 场景分类
- 主要场景:**{results['scene'][0]['label']}** ({results['scene'][0]['score']:.2%})
- 次要场景:{results['scene'][1]['label']} ({results['scene'][1]['score']:.2%})

## 图像描述
{results['description']}

## 自动问答
"""
        for question, answer in results['qa_results'].items():
            report += f"- **{question}**: {answer}\n"
        
        # 保存报告
        if save_path:
            with open(save_path, "w", encoding="utf-8") as f:
                f.write(report)
            print(f"\n💾 报告已保存:{save_path}")
        
        return report

# 使用示例
analyzer = IntelligentImageAnalyzer()

# 分析图像
report = analyzer.generate_report(
    image_path="test_image.jpg",
    save_path="analysis_report.md"
)

print("\n" + "="*60)
print(report)

6.1.7 学习资源

推荐课程

  • Stanford CS231n: 计算机视觉经典课程
  • Fast.ai Practical Deep Learning: 实战导向
  • DeepLearning.AI Computer Vision: 吴恩达视觉课程

开源项目

  • Ultralytics YOLOv8: 最先进的目标检测
  • Segment Anything (SAM): Meta 的分割模型
  • Grounding DINO: 开放词汇目标检测

论文阅读

实战练习

  1. 构建商品识别系统(分类 + 检测)
  2. 开发智能监控系统(视频目标检测)
  3. 实现以图搜图引擎(FAISS + 特征提取)
  4. 构建图像问答机器人(VLM + RAG)

关键要点

  • 预训练模型优先:不要从头训练,使用 HuggingFace 模型
  • 选择合适的模型:速度 vs 精度的权衡
  • 批量处理优化:使用 DataLoader 提升吞吐量
  • GPU 加速必备:视觉模型计算密集,CPU 太慢
  • 关注最新进展:VLM 领域快速发展,持续学习

常见错误

  • ❌ 忽视图像预处理(尺寸、归一化)
  • ❌ 置信度阈值设置不当
  • ❌ 显存溢出(batch size 过大)
  • ❌ 忽视推理速度优化

下一步 学完本节后,结合语音模型(6.2)和多模态融合(6.3),你将能够构建完整的多模态 AI 应用。

坚持是一种品格