多模态提示词工程

Vision Prompting、音频输入、图文联合推理与多模态 Prompt 设计模式 | 2026-02


一、多模态 LLM 的能力边界

多模态大模型(如 GPT-4o、Claude 3.5、Gemini 2.0)能同时处理文本、图像、音频甚至视频。但"能处理"不等于"能处好"——多模态提示词工程的核心挑战是:如何让模型在不同模态之间建立正确的关联,而不是各看各的

Multimodal Input Processing:

  Image ──────┐
              │
  Text  ──────┼──> [Multimodal Encoder] ──> [Unified Representation] ──> [Decoder] ──> Output
              │
  Audio ──────┘

Challenge: The model must ALIGN information across modalities
  - Image shows a chart -> Text asks about the trend -> Model must link both
  - Audio contains speech -> Text asks for summary -> Model must transcribe & analyze
  - Multiple images -> Text asks to compare -> Model must cross-reference

二、Vision Prompting 技术

2.1 图像理解的提示策略

策略 描述 适用场景 效果
直接提问 简单描述需求 通用场景 基础
区域引导 指定图像区域 细节分析 较好
对比分析 多图比较 差异检测
结构化提取 要求 JSON 输出 数据抽取 最佳
Chain-of-Sight 分步观察推理 复杂图像 最佳

2.2 Vision Prompt 设计模式

import base64
from openai import OpenAI

client = OpenAI()

def encode_image(image_path: str) -> str:
    """Encode image to base64 for API call."""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

# Pattern 1: Direct Question (simple, often insufficient)
def simple_vision_query(image_path: str, question: str) -> str:
    base64_image = encode_image(image_path)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{base64_image}",
                    "detail": "high",  # "low" | "high" | "auto"
                }},
            ],
        }],
    )
    return response.choices[0].message.content

# Pattern 2: Structured Extraction (production-grade)
EXTRACTION_PROMPT = """Analyze this image and extract information in the following JSON format:

{
  "image_type": "chart|photo|screenshot|document|diagram",
  "main_content": "One-sentence description of what the image shows",
  "extracted_data": {
    // For charts: {"title": "", "x_axis": "", "y_axis": "", "data_points": [...]}
    // For documents: {"text_content": "", "layout": ""}
    // For screenshots: {"app_name": "", "ui_elements": [...], "state": ""}
    // For diagrams: {"components": [...], "relationships": [...]}
  },
  "quality_issues": ["blur", "crop", "low_resolution"],  // if any
  "confidence": 0.0  // 0-1, your confidence in the extraction
}

Be precise. If you cannot read something clearly, mark confidence lower
and note it in quality_issues. Never fabricate data."""

def structured_image_extraction(image_path: str) -> dict:
    base64_image = encode_image(image_path)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": EXTRACTION_PROMPT},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{base64_image}",
                    "detail": "high",
                }},
            ],
        }],
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(response.choices[0].message.content)

2.3 Chain-of-Sight(分步视觉推理)

# Chain-of-Sight: Guide the model to observe systematically
# before making conclusions

CHAIN_OF_SIGHT_PROMPT = """Analyze this image step by step:

## Step 1: Overview
Describe the overall scene/content in one sentence.

## Step 2: Key Elements
List all significant visual elements you can identify:
- Element 1: [what it is, where it is, any text/numbers]
- Element 2: ...

## Step 3: Relationships
How do the elements relate to each other?
- Spatial relationships (above, below, inside, connected)
- Logical relationships (cause-effect, part-whole, sequence)

## Step 4: Details
For each key element, zoom in and describe:
- Colors, sizes, proportions
- Any text content (transcribe exactly)
- Any numbers or data (extract precisely)

## Step 5: Interpretation
Based on all observations above, answer: {question}

Important: Base your answer ONLY on what you can actually see.
If something is unclear, say so rather than guessing."""

async def chain_of_sight_analysis(
    image_path: str, question: str,
) -> str:
    """Use Chain-of-Sight for thorough image analysis."""
    base64_image = encode_image(image_path)
    prompt = CHAIN_OF_SIGHT_PROMPT.replace("{question}", question)

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{base64_image}",
                    "detail": "high",
                }},
            ],
        }],
        temperature=0.2,  # Lower temperature for analytical tasks
        max_tokens=4096,
    )
    return response.choices[0].message.content

三、多图推理

3.1 多图比较模式

# Multi-image comparison: before/after, A/B, sequence

COMPARISON_PROMPT = """You are given {n} images to compare.

## Task
{task_description}

## Analysis Framework
For each image:
1. Describe what you see
2. Note unique features

Then compare across all images:
3. Similarities (what's the same)
4. Differences (what changed or differs)
5. Ranking or conclusion based on the comparison criteria

## Output Format
{output_format}
"""

async def compare_images(
    image_paths: list[str],
    task: str,
    output_format: str = "Structured comparison table in markdown",
) -> str:
    """Compare multiple images with structured analysis."""
    content = [
        {"type": "text", "text": COMPARISON_PROMPT.format(
            n=len(image_paths),
            task_description=task,
            output_format=output_format,
        )},
    ]

    for i, path in enumerate(image_paths):
        base64_img = encode_image(path)
        content.append({"type": "text", "text": f"\n--- Image {i+1} ---"})
        content.append({"type": "image_url", "image_url": {
            "url": f"data:image/png;base64,{base64_img}",
            "detail": "high",
        }})

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=4096,
    )
    return response.choices[0].message.content

# Example: UI design comparison
result = await compare_images(
    image_paths=["design_v1.png", "design_v2.png", "design_v3.png"],
    task="Compare these three UI design variants for a dashboard. "
         "Evaluate: visual hierarchy, information density, color usage, "
         "and overall usability.",
    output_format="Markdown table with scores (1-10) per dimension, "
                  "plus a recommendation.",
)

3.2 图像序列理解

# Sequential image understanding (e.g., step-by-step instructions,
# UI flow analysis, video frame analysis)

SEQUENCE_PROMPT = """These {n} images form a sequence (ordered from first to last).

## Task
Analyze this sequence and:
1. Describe what happens at each step
2. Identify the overall process/workflow
3. Note any issues, missing steps, or improvements

## Context
{context}

## Output
Provide a structured step-by-step analysis with:
- Step number
- What's shown
- What action is being performed
- Any observations or issues
"""

async def analyze_image_sequence(
    image_paths: list[str],
    context: str = "User interaction flow",
) -> str:
    """Analyze a sequence of images as an ordered flow."""
    content = [
        {"type": "text", "text": SEQUENCE_PROMPT.format(
            n=len(image_paths),
            context=context,
        )},
    ]

    for i, path in enumerate(image_paths):
        base64_img = encode_image(path)
        content.append({"type": "text", "text": f"\n[Step {i+1}]"})
        content.append({"type": "image_url", "image_url": {
            "url": f"data:image/png;base64,{base64_img}",
            "detail": "high",
        }})

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=4096,
    )
    return response.choices[0].message.content

四、音频输入与语音提示

4.1 音频处理能力矩阵

模型 音频输入 实时语音 语言数 特殊能力
GPT-4o 是(Realtime API) 50+ 语气/情感识别
Gemini 2.0 是(Live API) 100+ 长音频理解
Claude 3.5 否(需先转文本) N/A 仅文本分析
Whisper v3 转录专用 流式 99 最佳转录质量

4.2 音频提示设计

# Audio input with GPT-4o (native audio understanding)

import io

async def analyze_audio_with_context(
    audio_path: str,
    analysis_prompt: str,
) -> str:
    """Analyze audio with GPT-4o's native audio understanding."""
    with open(audio_path, "rb") as f:
        audio_data = base64.b64encode(f.read()).decode("utf-8")

    # Determine format from extension
    ext = audio_path.rsplit(".", 1)[-1].lower()
    media_type = {
        "mp3": "audio/mp3",
        "wav": "audio/wav",
        "m4a": "audio/m4a",
        "ogg": "audio/ogg",
        "flac": "audio/flac",
    }.get(ext, "audio/mp3")

    response = await client.chat.completions.create(
        model="gpt-4o-audio-preview",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": analysis_prompt},
                {"type": "input_audio", "input_audio": {
                    "data": audio_data,
                    "format": ext,
                }},
            ],
        }],
    )
    return response.choices[0].message.content


# Pattern: Meeting analysis with structured extraction
MEETING_ANALYSIS_PROMPT = """Analyze this meeting recording and extract:

1. **Participants**: Who spoke (identify by voice characteristics if names not mentioned)
2. **Key Topics**: Main discussion points (ordered by time)
3. **Decisions Made**: Any decisions reached
4. **Action Items**: Tasks assigned (who, what, when)
5. **Sentiment**: Overall meeting tone and any notable emotional moments
6. **Unresolved**: Topics that need follow-up

Output as structured JSON matching this schema:
{
  "duration_estimate": "HH:MM",
  "participants": [{"id": "Speaker_1", "name": "if_known"}],
  "topics": [{"title": "", "duration": "", "summary": ""}],
  "decisions": [{"decision": "", "context": ""}],
  "action_items": [{"assignee": "", "task": "", "deadline": ""}],
  "sentiment": {"overall": "", "notable_moments": []},
  "unresolved": [""]
}"""

result = await analyze_audio_with_context(
    audio_path="meeting_recording.mp3",
    analysis_prompt=MEETING_ANALYSIS_PROMPT,
)

4.3 Whisper + LLM 两阶段管线

# Two-stage pipeline: Whisper transcription + LLM analysis
# More reliable than native audio for long recordings

from openai import OpenAI

client = OpenAI()

async def two_stage_audio_analysis(
    audio_path: str,
    analysis_prompt: str,
    language: str = "zh",
) -> dict:
    """Stage 1: Transcribe with Whisper. Stage 2: Analyze with LLM."""

    # Stage 1: Transcription
    with open(audio_path, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            language=language,
            response_format="verbose_json",  # Includes timestamps
            timestamp_granularities=["segment"],
        )

    # Build timestamped transcript
    timestamped_text = "\n".join(
        f"[{seg['start']:.1f}s - {seg['end']:.1f}s] {seg['text']}"
        for seg in transcript.segments
    )

    # Stage 2: LLM Analysis
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert audio content analyst."},
            {"role": "user", "content": f"""Timestamped transcript:

{timestamped_text}

---

{analysis_prompt}"""},
        ],
        temperature=0.3,
    )

    return {
        "transcript": timestamped_text,
        "analysis": response.choices[0].message.content,
        "duration_seconds": transcript.segments[-1]["end"] if transcript.segments else 0,
    }

五、图文联合推理

5.1 图文一致性检验

# Verify that image and text descriptions match
# Use case: product listing QA, document verification

CONSISTENCY_CHECK_PROMPT = """You are a quality assurance specialist.

## Task
Check if the image content matches the text description.

## Text Description
{text_description}

## Analysis Requirements
1. List all claims made in the text
2. For each claim, verify if the image supports it:
   - CONFIRMED: Image clearly shows this
   - CONTRADICTED: Image shows the opposite
   - UNVERIFIABLE: Cannot determine from the image
   - PARTIALLY: Image shows something related but not exact

3. Overall consistency score (0-100)
4. Critical mismatches that could mislead users

Output as JSON:
{
  "claims": [
    {"text": "", "status": "", "evidence": ""}
  ],
  "consistency_score": 0,
  "critical_mismatches": [],
  "recommendation": "approve|revise|reject"
}"""

async def check_text_image_consistency(
    image_path: str,
    text_description: str,
) -> dict:
    """Check if image matches text description."""
    base64_image = encode_image(image_path)
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": CONSISTENCY_CHECK_PROMPT.format(
                    text_description=text_description,
                )},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{base64_image}",
                    "detail": "high",
                }},
            ],
        }],
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(response.choices[0].message.content)

5.2 文档理解(OCR + 语义分析)

# Document understanding: combining OCR capability with semantic analysis

DOCUMENT_UNDERSTANDING_PROMPT = """Analyze this document image comprehensively.

## Extraction Requirements

### Layout Analysis
- Document type (invoice, contract, form, letter, report, etc.)
- Page structure (headers, sections, tables, signatures)
- Language(s) present

### Content Extraction
- All text content (preserve original formatting where possible)
- Table data (as structured arrays)
- Key-value pairs (form fields)
- Dates, amounts, reference numbers

### Semantic Analysis
- Document purpose/intent
- Key entities (people, organizations, addresses)
- Important terms or conditions
- Status indicators (approved, pending, rejected)

### Quality Assessment
- OCR confidence (any unclear/ambiguous text)
- Missing information
- Potential issues

Output as structured JSON with sections for each analysis type."""

async def understand_document(image_path: str) -> dict:
    """Full document understanding pipeline."""
    base64_image = encode_image(image_path)
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert document analyst. "
             "Extract ALL text precisely. Never fabricate content."},
            {"role": "user", "content": [
                {"type": "text", "text": DOCUMENT_UNDERSTANDING_PROMPT},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{base64_image}",
                    "detail": "high",
                }},
            ]},
        ],
        response_format={"type": "json_object"},
        temperature=0,  # Zero temperature for extraction tasks
    )
    import json
    return json.loads(response.choices[0].message.content)

六、多模态提示词设计原则

6.1 核心设计模式

Pattern 1: Text-First (image as supplement)
  [Detailed text instruction] + [Image for reference]
  Best for: Tasks where text defines the requirement, image provides context

Pattern 2: Image-First (text as query)
  [Image as primary input] + [Short text question]
  Best for: Image analysis, OCR, visual QA

Pattern 3: Parallel Modalities (equal weight)
  [Image A] + [Image B] + [Text comparing both]
  Best for: Comparison, verification, change detection

Pattern 4: Sequential Modalities (chain)
  [Image] -> [Extract text] -> [Analyze text] -> [Generate image]
  Best for: Complex pipelines, iterative refinement

Pattern 5: Grounded Generation (text anchored to image regions)
  [Image with annotations] + [Text referencing annotations]
  Best for: Spatial reasoning, UI analysis, medical imaging

6.2 最佳实践清单

实践 说明 常见错误
明确模态角色 告诉模型每个输入的用途 不说图片是什么,期望模型猜
使用 high detail 精细任务用 detail: high 默认 auto 导致细节丢失
图像预处理 裁剪、增强、标注关注区域 发送整张复杂图片问局部问题
分步推理 先描述再分析再回答 直接问复杂问题
多图标注 给每张图编号并在文本中引用 多图但不说哪张是哪张
输出格式约束 结构化 JSON 优于自由文本 让模型自由发挥
温度控制 提取任务 temp=0,创意任务 temp>0.5 提取任务用高温度

6.3 图像输入优化

from PIL import Image
import io

def optimize_image_for_api(
    image_path: str,
    max_dimension: int = 2048,
    quality: int = 85,
    target_format: str = "JPEG",
) -> str:
    """Optimize image before sending to API.

    Why: API charges by token, and image tokens depend on resolution.
    GPT-4o: low detail = 85 tokens, high detail = 85 + 170 * tiles
    Each tile = 512x512 pixels

    Optimization strategy:
    - Resize to reduce unnecessary tile count
    - Compress to reduce base64 size
    - Crop to focus on relevant region
    """
    img = Image.open(image_path)

    # Resize if too large (preserve aspect ratio)
    if max(img.size) > max_dimension:
        ratio = max_dimension / max(img.size)
        new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
        img = img.resize(new_size, Image.LANCZOS)

    # Convert to RGB if necessary (PNG with alpha -> JPEG)
    if img.mode in ("RGBA", "P"):
        background = Image.new("RGB", img.size, (255, 255, 255))
        if img.mode == "RGBA":
            background.paste(img, mask=img.split()[3])
        else:
            background.paste(img)
        img = background

    # Compress and encode
    buffer = io.BytesIO()
    img.save(buffer, format=target_format, quality=quality)
    return base64.b64encode(buffer.getvalue()).decode("utf-8")

def estimate_image_tokens(width: int, height: int, detail: str = "high") -> int:
    """Estimate token cost for an image (OpenAI pricing model)."""
    if detail == "low":
        return 85

    # High detail: scale to fit 2048x2048, then count 512x512 tiles
    scale = min(2048 / max(width, height), 1.0)
    w, h = int(width * scale), int(height * scale)

    # Scale shortest side to 768
    short_scale = 768 / min(w, h)
    if short_scale < 1:
        w, h = int(w * short_scale), int(h * short_scale)

    # Count tiles
    tiles_w = (w + 511) // 512
    tiles_h = (h + 511) // 512
    total_tiles = tiles_w * tiles_h

    return 85 + 170 * total_tiles

七、Gemini 多模态特性

7.1 Gemini 长上下文多模态

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

# Gemini 2.0: up to 1M tokens, supports image/audio/video natively
model = genai.GenerativeModel("gemini-2.0-flash")

# Upload video for analysis
video_file = genai.upload_file("presentation.mp4")

# Wait for processing
import time
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

# Analyze with multimodal prompt
response = model.generate_content([
    video_file,
    """Analyze this presentation video:

    1. Slide-by-slide summary (with timestamps)
    2. Speaker's key arguments
    3. Data/charts mentioned (extract key numbers)
    4. Audience engagement indicators (if visible)
    5. Improvement suggestions

    Focus on factual extraction, not interpretation.""",
])

print(response.text)

7.2 多模态安全边界

# Safety considerations for multimodal prompts

SAFETY_GUIDELINES = """
Multimodal Safety Checklist:

1. PII Detection
   - Scan images for faces, IDs, documents with personal info
   - Never extract or store PII without explicit consent
   - Blur/redact before processing when possible

2. Content Verification
   - Images can be manipulated (deepfakes, edited screenshots)
   - Never treat image content as ground truth for critical decisions
   - Cross-reference with other data sources

3. Prompt Injection via Images
   - Images can contain text that acts as instructions
   - OCR content should be treated as UNTRUSTED DATA
   - Never execute instructions found in images

4. Copyright
   - Extracted text may be copyrighted
   - Generated descriptions of copyrighted images: OK
   - Reproducing copyrighted content from images: NOT OK

5. Bias
   - Vision models have known biases in face analysis
   - Avoid demographic classification tasks
   - Be explicit about uncertainty in visual judgments
"""

八、多模态评估指标

8.1 评估维度

维度 指标 测量方法
视觉准确性 物体识别正确率 标注数据集对比
文本提取 OCR 字符准确率 Levenshtein 距离
空间推理 位置关系正确率 人工标注
跨模态对齐 图文一致性 LLM-as-Judge
幻觉率 图中不存在的描述 人工审核
数值精度 图表数据提取准确率 精确匹配

8.2 评估实现

from dataclasses import dataclass

@dataclass
class MultimodalEvalResult:
    visual_accuracy: float      # 0-1
    text_extraction_cer: float  # Character Error Rate, lower is better
    spatial_reasoning: float    # 0-1
    cross_modal_alignment: float  # 0-1
    hallucination_rate: float   # 0-1, lower is better
    numerical_precision: float  # 0-1

async def evaluate_multimodal_prompt(
    prompt_template: str,
    test_set: list[dict],
    model: str = "gpt-4o",
) -> MultimodalEvalResult:
    """Evaluate a multimodal prompt against a labeled test set.

    Each test case should have:
    - image_path: path to test image
    - ground_truth: expected structured output
    - task_type: "ocr" | "spatial" | "counting" | "comparison"
    """
    results = {
        "visual_correct": 0,
        "text_cer_sum": 0.0,
        "spatial_correct": 0,
        "alignment_scores": [],
        "hallucinations": 0,
        "numerical_correct": 0,
        "total": len(test_set),
    }

    for case in test_set:
        output = await run_multimodal_prompt(
            prompt_template, case["image_path"], model,
        )
        gt = case["ground_truth"]

        # Score based on task type
        if case["task_type"] == "ocr":
            cer = character_error_rate(output.get("text", ""), gt["text"])
            results["text_cer_sum"] += cer
        elif case["task_type"] == "spatial":
            if output.get("position") == gt["position"]:
                results["spatial_correct"] += 1
        elif case["task_type"] == "counting":
            if output.get("count") == gt["count"]:
                results["visual_correct"] += 1
            if output.get("count", 0) > gt["count"]:
                results["hallucinations"] += 1

        # Cross-modal alignment via LLM judge
        alignment = await judge_alignment(output, gt, case["image_path"])
        results["alignment_scores"].append(alignment)

    n = results["total"]
    return MultimodalEvalResult(
        visual_accuracy=results["visual_correct"] / n,
        text_extraction_cer=results["text_cer_sum"] / n,
        spatial_reasoning=results["spatial_correct"] / n,
        cross_modal_alignment=sum(results["alignment_scores"]) / n,
        hallucination_rate=results["hallucinations"] / n,
        numerical_precision=results["numerical_correct"] / n,
    )

def character_error_rate(predicted: str, reference: str) -> float:
    """Calculate Character Error Rate using edit distance."""
    import Levenshtein
    if not reference:
        return 0.0 if not predicted else 1.0
    return Levenshtein.distance(predicted, reference) / len(reference)

九、总结

多模态提示词工程的核心不是"把图片丢给模型",而是精心设计模态之间的协同方式。关键原则:

  1. 明确每个模态的角色:是数据源、是参考、还是约束条件
  2. 分步观察优于直接提问:Chain-of-Sight 对复杂图像效果显著
  3. 预处理决定上限:图像裁剪、增强、标注比调 prompt 更有效
  4. 评估要覆盖幻觉:多模态幻觉比纯文本更隐蔽,必须专门检测
  5. 成本意识:高分辨率图像 token 消耗大,按需选择 detail 级别

Maurice | maurice_wen@proton.me