多模态提示词工程
原创
灵阙教研团队
A 推荐 进阶 |
约 15 分钟阅读
更新于 2026-02-28 AI 导读
多模态提示词工程 Vision Prompting、音频输入、图文联合推理与多模态 Prompt 设计模式 | 2026-02 一、多模态 LLM 的能力边界 多模态大模型(如 GPT-4o、Claude 3.5、Gemini 2.0)能同时处理文本、图像、音频甚至视频。但"能处理"不等于"能处好"——多模态提示词工程的核心挑战是:如何让模型在不同模态之间建立正确的关联,而不是各看各的。...
多模态提示词工程
Vision Prompting、音频输入、图文联合推理与多模态 Prompt 设计模式 | 2026-02
一、多模态 LLM 的能力边界
多模态大模型(如 GPT-4o、Claude 3.5、Gemini 2.0)能同时处理文本、图像、音频甚至视频。但"能处理"不等于"能处好"——多模态提示词工程的核心挑战是:如何让模型在不同模态之间建立正确的关联,而不是各看各的。
Multimodal Input Processing:
Image ──────┐
│
Text ──────┼──> [Multimodal Encoder] ──> [Unified Representation] ──> [Decoder] ──> Output
│
Audio ──────┘
Challenge: The model must ALIGN information across modalities
- Image shows a chart -> Text asks about the trend -> Model must link both
- Audio contains speech -> Text asks for summary -> Model must transcribe & analyze
- Multiple images -> Text asks to compare -> Model must cross-reference
二、Vision Prompting 技术
2.1 图像理解的提示策略
| 策略 | 描述 | 适用场景 | 效果 |
|---|---|---|---|
| 直接提问 | 简单描述需求 | 通用场景 | 基础 |
| 区域引导 | 指定图像区域 | 细节分析 | 较好 |
| 对比分析 | 多图比较 | 差异检测 | 好 |
| 结构化提取 | 要求 JSON 输出 | 数据抽取 | 最佳 |
| Chain-of-Sight | 分步观察推理 | 复杂图像 | 最佳 |
2.2 Vision Prompt 设计模式
import base64
from openai import OpenAI
client = OpenAI()
def encode_image(image_path: str) -> str:
"""Encode image to base64 for API call."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
# Pattern 1: Direct Question (simple, often insufficient)
def simple_vision_query(image_path: str, question: str) -> str:
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}",
"detail": "high", # "low" | "high" | "auto"
}},
],
}],
)
return response.choices[0].message.content
# Pattern 2: Structured Extraction (production-grade)
EXTRACTION_PROMPT = """Analyze this image and extract information in the following JSON format:
{
"image_type": "chart|photo|screenshot|document|diagram",
"main_content": "One-sentence description of what the image shows",
"extracted_data": {
// For charts: {"title": "", "x_axis": "", "y_axis": "", "data_points": [...]}
// For documents: {"text_content": "", "layout": ""}
// For screenshots: {"app_name": "", "ui_elements": [...], "state": ""}
// For diagrams: {"components": [...], "relationships": [...]}
},
"quality_issues": ["blur", "crop", "low_resolution"], // if any
"confidence": 0.0 // 0-1, your confidence in the extraction
}
Be precise. If you cannot read something clearly, mark confidence lower
and note it in quality_issues. Never fabricate data."""
def structured_image_extraction(image_path: str) -> dict:
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": EXTRACTION_PROMPT},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}",
"detail": "high",
}},
],
}],
response_format={"type": "json_object"},
)
import json
return json.loads(response.choices[0].message.content)
2.3 Chain-of-Sight(分步视觉推理)
# Chain-of-Sight: Guide the model to observe systematically
# before making conclusions
CHAIN_OF_SIGHT_PROMPT = """Analyze this image step by step:
## Step 1: Overview
Describe the overall scene/content in one sentence.
## Step 2: Key Elements
List all significant visual elements you can identify:
- Element 1: [what it is, where it is, any text/numbers]
- Element 2: ...
## Step 3: Relationships
How do the elements relate to each other?
- Spatial relationships (above, below, inside, connected)
- Logical relationships (cause-effect, part-whole, sequence)
## Step 4: Details
For each key element, zoom in and describe:
- Colors, sizes, proportions
- Any text content (transcribe exactly)
- Any numbers or data (extract precisely)
## Step 5: Interpretation
Based on all observations above, answer: {question}
Important: Base your answer ONLY on what you can actually see.
If something is unclear, say so rather than guessing."""
async def chain_of_sight_analysis(
image_path: str, question: str,
) -> str:
"""Use Chain-of-Sight for thorough image analysis."""
base64_image = encode_image(image_path)
prompt = CHAIN_OF_SIGHT_PROMPT.replace("{question}", question)
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}",
"detail": "high",
}},
],
}],
temperature=0.2, # Lower temperature for analytical tasks
max_tokens=4096,
)
return response.choices[0].message.content
三、多图推理
3.1 多图比较模式
# Multi-image comparison: before/after, A/B, sequence
COMPARISON_PROMPT = """You are given {n} images to compare.
## Task
{task_description}
## Analysis Framework
For each image:
1. Describe what you see
2. Note unique features
Then compare across all images:
3. Similarities (what's the same)
4. Differences (what changed or differs)
5. Ranking or conclusion based on the comparison criteria
## Output Format
{output_format}
"""
async def compare_images(
image_paths: list[str],
task: str,
output_format: str = "Structured comparison table in markdown",
) -> str:
"""Compare multiple images with structured analysis."""
content = [
{"type": "text", "text": COMPARISON_PROMPT.format(
n=len(image_paths),
task_description=task,
output_format=output_format,
)},
]
for i, path in enumerate(image_paths):
base64_img = encode_image(path)
content.append({"type": "text", "text": f"\n--- Image {i+1} ---"})
content.append({"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_img}",
"detail": "high",
}})
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
max_tokens=4096,
)
return response.choices[0].message.content
# Example: UI design comparison
result = await compare_images(
image_paths=["design_v1.png", "design_v2.png", "design_v3.png"],
task="Compare these three UI design variants for a dashboard. "
"Evaluate: visual hierarchy, information density, color usage, "
"and overall usability.",
output_format="Markdown table with scores (1-10) per dimension, "
"plus a recommendation.",
)
3.2 图像序列理解
# Sequential image understanding (e.g., step-by-step instructions,
# UI flow analysis, video frame analysis)
SEQUENCE_PROMPT = """These {n} images form a sequence (ordered from first to last).
## Task
Analyze this sequence and:
1. Describe what happens at each step
2. Identify the overall process/workflow
3. Note any issues, missing steps, or improvements
## Context
{context}
## Output
Provide a structured step-by-step analysis with:
- Step number
- What's shown
- What action is being performed
- Any observations or issues
"""
async def analyze_image_sequence(
image_paths: list[str],
context: str = "User interaction flow",
) -> str:
"""Analyze a sequence of images as an ordered flow."""
content = [
{"type": "text", "text": SEQUENCE_PROMPT.format(
n=len(image_paths),
context=context,
)},
]
for i, path in enumerate(image_paths):
base64_img = encode_image(path)
content.append({"type": "text", "text": f"\n[Step {i+1}]"})
content.append({"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_img}",
"detail": "high",
}})
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
max_tokens=4096,
)
return response.choices[0].message.content
四、音频输入与语音提示
4.1 音频处理能力矩阵
| 模型 | 音频输入 | 实时语音 | 语言数 | 特殊能力 |
|---|---|---|---|---|
| GPT-4o | 是 | 是(Realtime API) | 50+ | 语气/情感识别 |
| Gemini 2.0 | 是 | 是(Live API) | 100+ | 长音频理解 |
| Claude 3.5 | 否(需先转文本) | 否 | N/A | 仅文本分析 |
| Whisper v3 | 转录专用 | 流式 | 99 | 最佳转录质量 |
4.2 音频提示设计
# Audio input with GPT-4o (native audio understanding)
import io
async def analyze_audio_with_context(
audio_path: str,
analysis_prompt: str,
) -> str:
"""Analyze audio with GPT-4o's native audio understanding."""
with open(audio_path, "rb") as f:
audio_data = base64.b64encode(f.read()).decode("utf-8")
# Determine format from extension
ext = audio_path.rsplit(".", 1)[-1].lower()
media_type = {
"mp3": "audio/mp3",
"wav": "audio/wav",
"m4a": "audio/m4a",
"ogg": "audio/ogg",
"flac": "audio/flac",
}.get(ext, "audio/mp3")
response = await client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": analysis_prompt},
{"type": "input_audio", "input_audio": {
"data": audio_data,
"format": ext,
}},
],
}],
)
return response.choices[0].message.content
# Pattern: Meeting analysis with structured extraction
MEETING_ANALYSIS_PROMPT = """Analyze this meeting recording and extract:
1. **Participants**: Who spoke (identify by voice characteristics if names not mentioned)
2. **Key Topics**: Main discussion points (ordered by time)
3. **Decisions Made**: Any decisions reached
4. **Action Items**: Tasks assigned (who, what, when)
5. **Sentiment**: Overall meeting tone and any notable emotional moments
6. **Unresolved**: Topics that need follow-up
Output as structured JSON matching this schema:
{
"duration_estimate": "HH:MM",
"participants": [{"id": "Speaker_1", "name": "if_known"}],
"topics": [{"title": "", "duration": "", "summary": ""}],
"decisions": [{"decision": "", "context": ""}],
"action_items": [{"assignee": "", "task": "", "deadline": ""}],
"sentiment": {"overall": "", "notable_moments": []},
"unresolved": [""]
}"""
result = await analyze_audio_with_context(
audio_path="meeting_recording.mp3",
analysis_prompt=MEETING_ANALYSIS_PROMPT,
)
4.3 Whisper + LLM 两阶段管线
# Two-stage pipeline: Whisper transcription + LLM analysis
# More reliable than native audio for long recordings
from openai import OpenAI
client = OpenAI()
async def two_stage_audio_analysis(
audio_path: str,
analysis_prompt: str,
language: str = "zh",
) -> dict:
"""Stage 1: Transcribe with Whisper. Stage 2: Analyze with LLM."""
# Stage 1: Transcription
with open(audio_path, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
language=language,
response_format="verbose_json", # Includes timestamps
timestamp_granularities=["segment"],
)
# Build timestamped transcript
timestamped_text = "\n".join(
f"[{seg['start']:.1f}s - {seg['end']:.1f}s] {seg['text']}"
for seg in transcript.segments
)
# Stage 2: LLM Analysis
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are an expert audio content analyst."},
{"role": "user", "content": f"""Timestamped transcript:
{timestamped_text}
---
{analysis_prompt}"""},
],
temperature=0.3,
)
return {
"transcript": timestamped_text,
"analysis": response.choices[0].message.content,
"duration_seconds": transcript.segments[-1]["end"] if transcript.segments else 0,
}
五、图文联合推理
5.1 图文一致性检验
# Verify that image and text descriptions match
# Use case: product listing QA, document verification
CONSISTENCY_CHECK_PROMPT = """You are a quality assurance specialist.
## Task
Check if the image content matches the text description.
## Text Description
{text_description}
## Analysis Requirements
1. List all claims made in the text
2. For each claim, verify if the image supports it:
- CONFIRMED: Image clearly shows this
- CONTRADICTED: Image shows the opposite
- UNVERIFIABLE: Cannot determine from the image
- PARTIALLY: Image shows something related but not exact
3. Overall consistency score (0-100)
4. Critical mismatches that could mislead users
Output as JSON:
{
"claims": [
{"text": "", "status": "", "evidence": ""}
],
"consistency_score": 0,
"critical_mismatches": [],
"recommendation": "approve|revise|reject"
}"""
async def check_text_image_consistency(
image_path: str,
text_description: str,
) -> dict:
"""Check if image matches text description."""
base64_image = encode_image(image_path)
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": CONSISTENCY_CHECK_PROMPT.format(
text_description=text_description,
)},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}",
"detail": "high",
}},
],
}],
response_format={"type": "json_object"},
)
import json
return json.loads(response.choices[0].message.content)
5.2 文档理解(OCR + 语义分析)
# Document understanding: combining OCR capability with semantic analysis
DOCUMENT_UNDERSTANDING_PROMPT = """Analyze this document image comprehensively.
## Extraction Requirements
### Layout Analysis
- Document type (invoice, contract, form, letter, report, etc.)
- Page structure (headers, sections, tables, signatures)
- Language(s) present
### Content Extraction
- All text content (preserve original formatting where possible)
- Table data (as structured arrays)
- Key-value pairs (form fields)
- Dates, amounts, reference numbers
### Semantic Analysis
- Document purpose/intent
- Key entities (people, organizations, addresses)
- Important terms or conditions
- Status indicators (approved, pending, rejected)
### Quality Assessment
- OCR confidence (any unclear/ambiguous text)
- Missing information
- Potential issues
Output as structured JSON with sections for each analysis type."""
async def understand_document(image_path: str) -> dict:
"""Full document understanding pipeline."""
base64_image = encode_image(image_path)
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are an expert document analyst. "
"Extract ALL text precisely. Never fabricate content."},
{"role": "user", "content": [
{"type": "text", "text": DOCUMENT_UNDERSTANDING_PROMPT},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}",
"detail": "high",
}},
]},
],
response_format={"type": "json_object"},
temperature=0, # Zero temperature for extraction tasks
)
import json
return json.loads(response.choices[0].message.content)
六、多模态提示词设计原则
6.1 核心设计模式
Pattern 1: Text-First (image as supplement)
[Detailed text instruction] + [Image for reference]
Best for: Tasks where text defines the requirement, image provides context
Pattern 2: Image-First (text as query)
[Image as primary input] + [Short text question]
Best for: Image analysis, OCR, visual QA
Pattern 3: Parallel Modalities (equal weight)
[Image A] + [Image B] + [Text comparing both]
Best for: Comparison, verification, change detection
Pattern 4: Sequential Modalities (chain)
[Image] -> [Extract text] -> [Analyze text] -> [Generate image]
Best for: Complex pipelines, iterative refinement
Pattern 5: Grounded Generation (text anchored to image regions)
[Image with annotations] + [Text referencing annotations]
Best for: Spatial reasoning, UI analysis, medical imaging
6.2 最佳实践清单
| 实践 | 说明 | 常见错误 |
|---|---|---|
| 明确模态角色 | 告诉模型每个输入的用途 | 不说图片是什么,期望模型猜 |
| 使用 high detail | 精细任务用 detail: high |
默认 auto 导致细节丢失 |
| 图像预处理 | 裁剪、增强、标注关注区域 | 发送整张复杂图片问局部问题 |
| 分步推理 | 先描述再分析再回答 | 直接问复杂问题 |
| 多图标注 | 给每张图编号并在文本中引用 | 多图但不说哪张是哪张 |
| 输出格式约束 | 结构化 JSON 优于自由文本 | 让模型自由发挥 |
| 温度控制 | 提取任务 temp=0,创意任务 temp>0.5 | 提取任务用高温度 |
6.3 图像输入优化
from PIL import Image
import io
def optimize_image_for_api(
image_path: str,
max_dimension: int = 2048,
quality: int = 85,
target_format: str = "JPEG",
) -> str:
"""Optimize image before sending to API.
Why: API charges by token, and image tokens depend on resolution.
GPT-4o: low detail = 85 tokens, high detail = 85 + 170 * tiles
Each tile = 512x512 pixels
Optimization strategy:
- Resize to reduce unnecessary tile count
- Compress to reduce base64 size
- Crop to focus on relevant region
"""
img = Image.open(image_path)
# Resize if too large (preserve aspect ratio)
if max(img.size) > max_dimension:
ratio = max_dimension / max(img.size)
new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
img = img.resize(new_size, Image.LANCZOS)
# Convert to RGB if necessary (PNG with alpha -> JPEG)
if img.mode in ("RGBA", "P"):
background = Image.new("RGB", img.size, (255, 255, 255))
if img.mode == "RGBA":
background.paste(img, mask=img.split()[3])
else:
background.paste(img)
img = background
# Compress and encode
buffer = io.BytesIO()
img.save(buffer, format=target_format, quality=quality)
return base64.b64encode(buffer.getvalue()).decode("utf-8")
def estimate_image_tokens(width: int, height: int, detail: str = "high") -> int:
"""Estimate token cost for an image (OpenAI pricing model)."""
if detail == "low":
return 85
# High detail: scale to fit 2048x2048, then count 512x512 tiles
scale = min(2048 / max(width, height), 1.0)
w, h = int(width * scale), int(height * scale)
# Scale shortest side to 768
short_scale = 768 / min(w, h)
if short_scale < 1:
w, h = int(w * short_scale), int(h * short_scale)
# Count tiles
tiles_w = (w + 511) // 512
tiles_h = (h + 511) // 512
total_tiles = tiles_w * tiles_h
return 85 + 170 * total_tiles
七、Gemini 多模态特性
7.1 Gemini 长上下文多模态
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
# Gemini 2.0: up to 1M tokens, supports image/audio/video natively
model = genai.GenerativeModel("gemini-2.0-flash")
# Upload video for analysis
video_file = genai.upload_file("presentation.mp4")
# Wait for processing
import time
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
# Analyze with multimodal prompt
response = model.generate_content([
video_file,
"""Analyze this presentation video:
1. Slide-by-slide summary (with timestamps)
2. Speaker's key arguments
3. Data/charts mentioned (extract key numbers)
4. Audience engagement indicators (if visible)
5. Improvement suggestions
Focus on factual extraction, not interpretation.""",
])
print(response.text)
7.2 多模态安全边界
# Safety considerations for multimodal prompts
SAFETY_GUIDELINES = """
Multimodal Safety Checklist:
1. PII Detection
- Scan images for faces, IDs, documents with personal info
- Never extract or store PII without explicit consent
- Blur/redact before processing when possible
2. Content Verification
- Images can be manipulated (deepfakes, edited screenshots)
- Never treat image content as ground truth for critical decisions
- Cross-reference with other data sources
3. Prompt Injection via Images
- Images can contain text that acts as instructions
- OCR content should be treated as UNTRUSTED DATA
- Never execute instructions found in images
4. Copyright
- Extracted text may be copyrighted
- Generated descriptions of copyrighted images: OK
- Reproducing copyrighted content from images: NOT OK
5. Bias
- Vision models have known biases in face analysis
- Avoid demographic classification tasks
- Be explicit about uncertainty in visual judgments
"""
八、多模态评估指标
8.1 评估维度
| 维度 | 指标 | 测量方法 |
|---|---|---|
| 视觉准确性 | 物体识别正确率 | 标注数据集对比 |
| 文本提取 | OCR 字符准确率 | Levenshtein 距离 |
| 空间推理 | 位置关系正确率 | 人工标注 |
| 跨模态对齐 | 图文一致性 | LLM-as-Judge |
| 幻觉率 | 图中不存在的描述 | 人工审核 |
| 数值精度 | 图表数据提取准确率 | 精确匹配 |
8.2 评估实现
from dataclasses import dataclass
@dataclass
class MultimodalEvalResult:
visual_accuracy: float # 0-1
text_extraction_cer: float # Character Error Rate, lower is better
spatial_reasoning: float # 0-1
cross_modal_alignment: float # 0-1
hallucination_rate: float # 0-1, lower is better
numerical_precision: float # 0-1
async def evaluate_multimodal_prompt(
prompt_template: str,
test_set: list[dict],
model: str = "gpt-4o",
) -> MultimodalEvalResult:
"""Evaluate a multimodal prompt against a labeled test set.
Each test case should have:
- image_path: path to test image
- ground_truth: expected structured output
- task_type: "ocr" | "spatial" | "counting" | "comparison"
"""
results = {
"visual_correct": 0,
"text_cer_sum": 0.0,
"spatial_correct": 0,
"alignment_scores": [],
"hallucinations": 0,
"numerical_correct": 0,
"total": len(test_set),
}
for case in test_set:
output = await run_multimodal_prompt(
prompt_template, case["image_path"], model,
)
gt = case["ground_truth"]
# Score based on task type
if case["task_type"] == "ocr":
cer = character_error_rate(output.get("text", ""), gt["text"])
results["text_cer_sum"] += cer
elif case["task_type"] == "spatial":
if output.get("position") == gt["position"]:
results["spatial_correct"] += 1
elif case["task_type"] == "counting":
if output.get("count") == gt["count"]:
results["visual_correct"] += 1
if output.get("count", 0) > gt["count"]:
results["hallucinations"] += 1
# Cross-modal alignment via LLM judge
alignment = await judge_alignment(output, gt, case["image_path"])
results["alignment_scores"].append(alignment)
n = results["total"]
return MultimodalEvalResult(
visual_accuracy=results["visual_correct"] / n,
text_extraction_cer=results["text_cer_sum"] / n,
spatial_reasoning=results["spatial_correct"] / n,
cross_modal_alignment=sum(results["alignment_scores"]) / n,
hallucination_rate=results["hallucinations"] / n,
numerical_precision=results["numerical_correct"] / n,
)
def character_error_rate(predicted: str, reference: str) -> float:
"""Calculate Character Error Rate using edit distance."""
import Levenshtein
if not reference:
return 0.0 if not predicted else 1.0
return Levenshtein.distance(predicted, reference) / len(reference)
九、总结
多模态提示词工程的核心不是"把图片丢给模型",而是精心设计模态之间的协同方式。关键原则:
- 明确每个模态的角色:是数据源、是参考、还是约束条件
- 分步观察优于直接提问:Chain-of-Sight 对复杂图像效果显著
- 预处理决定上限:图像裁剪、增强、标注比调 prompt 更有效
- 评估要覆盖幻觉:多模态幻觉比纯文本更隐蔽,必须专门检测
- 成本意识:高分辨率图像 token 消耗大,按需选择 detail 级别
Maurice | maurice_wen@proton.me