AI 视频智能体产品级白皮书|SOTA Video Agent Blueprint
原创
灵阙教研团队
A 推荐 提升 |
约 14 分钟阅读
更新于 2025-12-25 AI 导读
AI 视频智能体产品级白皮书 SOTA Video Agent Blueprint:从一句话到电影级成片的可回放制作系统。 核心由多智能体剧组、三圣经与锚点一致性、No-Rollback 版本化、QC 归因修复闭环、预算自适应调度共同构成。 Core: Planner Orchestrator Pipeline: Brief → Story → Shot → Assets → Edit → QC...
AI 视频智能体产品级白皮书
SOTA Video Agent Blueprint:从一句话到电影级成片的可回放制作系统。
核心由多智能体剧组、三圣经与锚点一致性、No-Rollback 版本化、QC 归因修复闭环、预算自适应调度共同构成。
Core: Planner Orchestrator
Pipeline: Brief → Story → Shot → Assets → Edit → QC → Publish
Deterministic: Remotion/Code for Text & UI
Artifacts: Contract-First JSON
0. 摘要
本白皮书提出一套面向“电影级体验”的 AI 视频智能体(Video Agent) 产品架构:系统不以“单模型生成视频”为目标, 而以“可控制作(Production)”为核心——把用户意图转化为可执行的影视制作流水线,并通过 多智能体剧组分工、契约化工具调用、不可变资产版本化(No-Rollback)、质量闭环(QC→归因→自动修复) 实现 高一致性、高稳定性、可迭代、可扩展、可回放 的 SOTA 体验。
核心结论:真正的 SOTA 来自系统工程——“制作系统”>“生成模型”。你交付的是可解释、可修复、可复刻的成片流水线。
1. 体验定义与成功指标
1.1 体验定义:用户要的不是“生成”,而是“交付”
- 更像一个制作团队:能规划、能解释、能修复、能复刻
- 交付稳定:失败可自动恢复,不靠人类重复写 prompt
- 可控可调:质量/成本/速度可明确取舍
- 可复用:角色/品牌/风格长期资产化
1.2 核心指标(必须可测)
语义对齐
脚本要点覆盖率 / intent 命中
一致性
角色/世界观/调色/品牌漂移率
稳定性
一次成功率、平均重试、自动修复率
可控性
风格/节奏/比例/字幕语言稳定生效
成本效率
每分钟成本、缓存复用率
时延体验
首屏预览时间、最终交付时间
2. 产品形态与用户旅程
2.1 三档模式(同一底座,不同暴露程度)
- One-Click(默认):一句话 → 30 秒内给草稿 + 3 个风格候选
- Pro(可控):展示“导演计划”:脚本/镜头表/风格圣经/预算滑杆(快/稳/贵)
- Studio(可编排):可编辑 shotlist、模板系统、资产库复用、品牌包、多语言配音、批量生成
2.2 “导演计划”是信任的关键
每次生成前输出:Plan(怎么做)+ Trade-off(质量/成本/速度)+ Failover(失败如何自动修复)。
3. 端到端流水线:从 Brief 到 Publish
3.1 总览:制作而非生成
Brief → Story → Shot → Assets → Edit → QC → Publish
3.2 产物清单(Artifacts)
brief.json:用户意图与约束(时长、比例、受众、禁忌、品牌)script.md / script.json:旁白/对白/信息密度/情绪曲线style_bible.json:风格圣经(色彩、镜头语言、光照、材质、字体)character_bible.json:角色圣经(禁改项、服装、动作模板)world_bible.json:世界观圣经(场景规律、道具、氛围)shotlist.json:镜头表(逐镜头可执行规格)edit_manifest.json:剪辑清单(等价 EDL / Remotion props)qc_report.json:质量报告(打分、归因、修复建议)final.mp4:最终成片 +final_meta.json
4. 多智能体“电影剧组”架构
4.1 角色分工(每个 agent 输出可评估产物)
Producer(制片)
预算/时延/风险;并发与降级策略
Director(导演)
叙事与镜头语言;风格总控
Scriptwriter(编剧)
脚本、旁白、情绪曲线、信息密度
Storyboarder(分镜)
脚本 → shotlist(结构化 JSON)
Art Director(美术)
三圣经(Style/Character/World)与锚点策略
Runner(出片执行)
镜头级调用模型生成 clips
Editor(剪辑)
节奏、转场、字幕模板、踩点对齐
QC Inspector(质检)
打分、归因、触发自动修复与重试
4.2 协作原则
- Contract-First:所有 agent 通过 JSON 契约对齐,不靠“猜”
- Deterministic Assembly:字幕/图形/UI 由代码渲染,避免 AI 文字模糊抖动
- No-Rollback:失败不覆盖旧资产,只增量版本,保证可追溯与可复刻
5. 一致性体系:Anchor + Bible 是上限
5.1 三圣经(Bibles)
- Style Bible:色彩、对比度、颗粒感、镜头语言、光照规则、字体与排版、安全区
- Character Bible:面部特征、发型、服饰套装、体态、表情范围、禁改项
- World Bible:场景资产、道具清单、时代风格、物理规则、环境氛围(雨/雾/尘)
5.2 锚点(Anchors)分层
- 角色锚点:三视图 + 关键表情/动作关键帧
- 场景锚点:场景定调图 + LUT/调色规则
- 道具锚点:关键道具外观锁定
- 音色锚点(可选):旁白声线与情绪基线
一致性优先:先锚定(Anchor)→ 再驱动(Drive)→ 再组装(Assemble)。
6. 镜头表(Shotlist)规范:系统可执行的核心文档
6.1 Shotlist 最小字段(示例)
复制{
"shot_id": "S03",
"duration_sec": 4.0,
"type": "character|world|vfx|ui",
"intent": "表达关键卖点/情绪转折/信息点",
"prompt": {
"visual": "...",
"motion": "...",
"camera": "35mm, dolly-in, shallow DOF",
"lighting": "soft key, warm rim",
"style_refs": ["style_bible:v1"],
"anchors": ["char_anchor:v2", "scene_anchor:v1"]
},
"audio": {
"voiceover": "旁白文本",
"sfx": ["whoosh_soft"],
"music_cue": "beat@12.5"
},
"subtitle": {
"text": "字幕文本",
"template": "kinetic_01",
"safe_area": true
},
"quality_target": {
"min_score": 0.82,
"critical": ["identity", "readability"]
}
}
6.2 镜头分级(成本/质量自适应)
- S 级:主镜头(产品核心/人物特写)→ 更强模型 + 更多采样 + 更严 QC
- A 级:叙事推进镜头 → 平衡成本
- B 级:转场/氛围/B-roll → 便宜模型或素材库复用
7. 后期与组装:Remotion 作为确定性渲染引擎
7.1 为什么必须代码渲染字幕/图形
- AI 生成文字易模糊、抖动、错别字、布局不可控
- 工程渲染确保 清晰度、可读性、安全区合规、品牌一致
- Remotion/FFmpeg 负责最终合成、响度归一化、多码率封装
7.2 编辑清单(Edit Manifest)
每个 clip 的入点/出点、转场、字幕时间轴、UI overlay、BGM 对齐点 → 输出 edit_manifest.json 一键渲染复现。
8. 质量闭环(QC):SOTA 稳定性的来源
8.1 质量维度(建议至少 8 项)
语义对齐
镜头是否表达脚本要点
角色一致性
脸/发型/服装/体型漂移
画面稳定
闪烁、形变、鬼影
运动合理
物理/姿态/口型
字幕可读
遮挡、安全区、断行
节奏
平均时长、停顿、信息密度
音画一致
旁白匹配、踩点
品牌一致
色彩/字体/Logo 规则
8.2 QC 输出(qc_report.json)
- 每镜头分数 + 总分
- 失败归因标签:
identity_drift / flicker / subtitle_occlusion / off_brief… - 修复建议:自动生成修复策略与重试参数
8.3 失败归因 → 自动修复矩阵(核心)
| 失败类型 | 典型症状 | 自动修复动作 |
|---|---|---|
| identity_drift | 角色脸漂移 | 回到角色锚点重采样;提高锚点权重;限制服装/发型 |
| flicker/warp | 闪烁/形变 | 更换参数;缩短镜头;转 B-roll;后处理去闪烁 |
| off_brief | 与意图不符 | 重写该镜头 intent + prompt;替换镜头类型 |
| subtitle_occlusion | 字幕遮挡主体 | Remotion 模板自动换位 + 智能避让主体 |
| pacing_bad | 节奏不合理 | 剪短/重排/加 B-roll;音乐点对齐 |
| audio_mismatch | 音画不匹配 | 重写旁白或替换镜头;重新踩点 |
没有自动归因与修复,就没有稳定交付。
9. No-Rollback 不可变版本化:可回放与可审计
9.1 版本化原则
- 所有资产不可变:写入
artifacts/{job_id}/v{n}/ - 失败只增量版本,不覆盖旧文件
- 任何成片必须能通过
edit_manifest.json + assets一键重放(Replayable)
9.2 目录建议
artifacts/job_001/
brief.json
style_bible/v1.json
character_bible/v2.json
shotlist/v3.json
assets/
anchors/...
clips/...
audio/...
v0001/
edit_manifest.json
qc_report.json
render_log.txt
final.mp4
v0002/ ...
10. 调度与成本控制:Budget-Aware Scheduler
10.1 调度目标
- 首版要快:优先产出可预览草稿
- 高价值镜头要稳:S/A/B 分级资源倾斜
- 失败要可控:重试次数上限;必要时明确降级
10.2 关键策略
- 并发:镜头级并行(按预算控制并发数)
- 缓存:角色/场景锚点、LUT、字幕模板、音乐段落复用
- 降级:主镜头失败 → 重试+约束增强;次要镜头失败 → 素材库/B-roll/静帧动效替代
- 预算滑杆:
fast/balanced/premium映射到采样次数、模型选择、QC 阈值、重试上限
11. 安全与合规
- 版权:素材来源标注与可追溯;避免直接复刻受保护作品
- 肖像/商标:明确用户授权;品牌包与 Logo 使用规则可配置
- 内容安全:敏感内容检测;地域/行业合规策略
- 审计:关键决策与生成参数记录到 job 日志(便于复盘与风控)
12. 落地路线图(MVP → Beta → Studio)
12.1 MVP(2–4 周)
- One-Click:一句话 → 脚本 → shotlist → 6–10 镜头 → Remotion 合成
- 基础 No-Rollback + 产物落盘
- 基础 QC:角色漂移、闪烁、字幕遮挡三类
验收
首版预览 < 60s;一次成功率 > 60%
12.2 Beta(4–8 周)
- Pro 模式:可见导演计划 + 预算滑杆
- 三圣经体系 + 锚点复用
- QC 扩展到 8 维 + 自动修复矩阵
验收
一次成功率 > 80%;平均重试 < 1.5;一致性漂移显著下降
12.3 Studio(8–12 周)
- 工作台:脚本/镜头表/圣经/资产库/版本对比
- 团队协作、批量生成、品牌包管理
- 可回放发布:任意版本一键复刻
验收
同一角色/品牌连续 10 条视频一致性稳定;规模化生产
13. 附录:契约与模板
本附录给出工程可落地的 Schema、模板包、QC 算法草案与自动修复 Patch 规范。
A. Schema 规范(核心文件)
brief.json(用户意图与约束)
复制{
"job_id": "job_20251224_0001",
"request": {
"prompt": "一句话需求原文",
"goal": "promo|edu|demo|drama|report|mashup",
"audience": "general|professional|teen|enterprise",
"tone": "premium|fun|serious|warm|energetic",
"language": "zh-CN",
"duration_sec": 45,
"aspect_ratio": "9:16|16:9|1:1|4:3",
"platform": "douyin|bilibili|youtube|xiaohongshu"
},
"constraints": {
"must_have": ["出现产品Logo", "强调卖点A"],
"must_not": ["血腥", "特定敏感词"],
"brand_pack": "brand_x_v3",
"music_style": "cinematic|lofi|upbeat",
"subtitle": { "enabled": true, "style": "kinetic_01" },
"voiceover": { "enabled": true, "speaker": "female_01", "emotion": "confident" }
},
"budget": {
"mode": "fast|balanced|premium",
"max_retries_per_shot": 2,
"max_total_cost": 8.0,
"deadline_sec": 120
}
}
style_bible.json(风格圣经)
复制{
"version": "v1",
"look": {
"palette": { "primary": "#8b5cf6", "bg": "#09090b", "text": "#e4e4e7" },
"contrast": "medium-high",
"grain": "subtle",
"lut": "teal_orange_soft"
},
"cinematography": {
"lens": ["35mm", "50mm"],
"camera_moves": ["dolly-in", "slow-pan"],
"do_not": ["handheld_shaky", "fisheye"]
},
"lighting": { "key": "soft", "temperature": "warm", "rim": true },
"typography": {
"font_zh": "NotoSansSC",
"font_en": "Inter",
"subtitle_safe_area": true
},
"composition": { "subject_rule": "center-third", "headroom": "medium" }
}
character_bible.json(角色圣经)
复制{
"version": "v2",
"characters": [
{
"id": "char_01",
"name": "主角",
"anchors": {
"sheet": "assets/anchors/char_01_sheet_v2.png",
"expressions": ["assets/anchors/char_01_smile.png"]
},
"lock": { "hair": true, "outfit": true, "face": true, "body": true },
"outfits": ["purple_jacket_v1"],
"do_not": ["change_gender", "change_age_group", "tattoos"]
}
]
}
shotlist.json(镜头表)
复制{
"version": "v3",
"global": { "fps": 30, "style_bible": "style_bible/v1.json", "audio_bpm": 120 },
"shots": [
{
"shot_id": "S01",
"grade": "S",
"duration_sec": 4.0,
"type": "character",
"intent": "开场建立主角与场景",
"inputs": {
"char_id": "char_01",
"anchors": ["assets/anchors/char_01_sheet_v2.png"],
"scene_anchor": "assets/anchors/scene_citynight_v1.png"
},
"gen": {
"prompt": {
"visual": "cinematic night city, neon, ...",
"motion": "walk toward camera, confident",
"camera": "35mm, dolly-in, shallow DOF"
},
"model_policy": { "preferred": ["kling_2_6_pro"], "fallback": ["veo_3_1"] },
"sampling": { "n": 2, "seed_policy": "locked_after_pass" }
},
"audio": { "voiceover": "一句旁白", "sfx": ["whoosh_soft"], "music_cue": "beat@0.0" },
"subtitle": { "text": "字幕", "template": "kinetic_01" },
"quality_target": { "min_score": 0.84, "critical": ["identity", "readability"] }
}
]
}
edit_manifest.json(剪辑清单 / Remotion props)
复制{
"fps": 30,
"resolution": { "w": 1080, "h": 1920 },
"timeline": [
{
"asset": "assets/clips/S01_take2.mp4",
"start_frame": 0,
"duration_frames": 120,
"subtitle": { "text": "字幕", "template": "kinetic_01", "pos": "auto_avoid_subject" },
"overlays": [{ "type": "logo", "asset": "brand/logo.png", "pos": "top_right" }]
}
],
"audio": {
"music": "assets/audio/bgm.mp3",
"voiceover": "assets/audio/vo.wav",
"mix": { "ducking": true, "loudness_target_lufs": -14 }
}
}
qc_report.json(质量报告)
复制{
"overall_score": 0.86,
"gates": { "pass": true, "critical_fail": [] },
"shot_scores": [
{
"shot_id": "S01",
"score": 0.88,
"metrics": {
"on_brief": 0.9,
"identity": 0.92,
"stability": 0.8,
"readability": 0.95,
"audio_match": 0.85
},
"issues": [],
"repair_suggestions": []
}
],
"summary": { "retries_used": 1, "cost_est": 2.1, "latency_sec": 58 }
}
B. 模板包(4 套,可直接跑)
short_edu_9x16_v1(短视频科普)
复制{
"template_id": "short_edu_9x16_v1",
"defaults": {
"request": { "goal": "edu", "duration_sec": 40, "aspect_ratio": "9:16", "tone": "energetic" },
"constraints": { "subtitle": { "enabled": true, "style": "kinetic_02" }, "voiceover": { "enabled": true, "emotion": "confident" } },
"budget": { "mode": "balanced", "max_retries_per_shot": 1, "deadline_sec": 90 }
},
"shot_pattern": [
{ "id": "Hook", "type": "ui", "sec": 3, "grade": "A" },
{ "id": "Point1", "type": "ui", "sec": 7, "grade": "A" },
{ "id": "Broll1", "type": "world", "sec": 4, "grade": "B" },
{ "id": "Point2", "type": "ui", "sec": 7, "grade": "A" },
{ "id": "Broll2", "type": "world", "sec": 4, "grade": "B" },
{ "id": "Point3", "type": "ui", "sec": 7, "grade": "A" },
{ "id": "CTA", "type": "ui", "sec": 6, "grade": "A" }
],
"qc_gate": { "hard": ["readability", "on_brief", "safe_area"], "soft": ["stability"] }
}
brand_film_16x9_v1(品牌宣传片)
复制{
"template_id": "brand_film_16x9_v1",
"defaults": {
"request": { "goal": "promo", "duration_sec": 55, "aspect_ratio": "16:9", "tone": "premium" },
"constraints": { "subtitle": { "enabled": false }, "voiceover": { "enabled": true, "emotion": "warm" } },
"budget": { "mode": "premium", "max_retries_per_shot": 2, "deadline_sec": 180 }
},
"shot_pattern": [
{ "id": "MoodOpen", "type": "world", "sec": 8, "grade": "A" },
{ "id": "HeroShot", "type": "character", "sec": 6, "grade": "S" },
{ "id": "Value1", "type": "world", "sec": 7, "grade": "A" },
{ "id": "Value2", "type": "world", "sec": 7, "grade": "A" },
{ "id": "Proof", "type": "ui", "sec": 8, "grade": "A" },
{ "id": "Closing", "type": "ui", "sec": 6, "grade": "A" },
{ "id": "EndCard", "type": "ui", "sec": 3, "grade": "A" }
],
"qc_gate": { "hard": ["brand_consistency", "stability", "on_brief"], "soft": ["audio_match"] }
}
product_demo_ui_v1(产品功能讲解)
复制{
"template_id": "product_demo_ui_v1",
"defaults": {
"request": { "goal": "demo", "duration_sec": 70, "aspect_ratio": "16:9", "tone": "serious" },
"constraints": { "subtitle": { "enabled": true, "style": "clean_lowerthird" }, "voiceover": { "enabled": true, "emotion": "neutral" } },
"budget": { "mode": "balanced", "max_retries_per_shot": 1, "deadline_sec": 150 }
},
"shot_pattern": [
{ "id": "IntroUI", "type": "ui", "sec": 6, "grade": "A" },
{ "id": "Step1", "type": "ui", "sec": 12, "grade": "A" },
{ "id": "Step2", "type": "ui", "sec": 12, "grade": "A" },
{ "id": "Step3", "type": "ui", "sec": 12, "grade": "A" },
{ "id": "Broll", "type": "world", "sec": 6, "grade": "B" },
{ "id": "Summary", "type": "ui", "sec": 10, "grade": "A" },
{ "id": "CTA", "type": "ui", "sec": 12, "grade": "A" }
],
"qc_gate": { "hard": ["readability", "safe_area", "on_brief"], "soft": ["stability"] }
}
micro_drama_character_v1(微剧情)
复制{
"template_id": "micro_drama_character_v1",
"defaults": {
"request": { "goal": "drama", "duration_sec": 60, "aspect_ratio": "9:16", "tone": "warm" },
"constraints": { "subtitle": { "enabled": true, "style": "dialogue_bubble" }, "voiceover": { "enabled": false } },
"budget": { "mode": "premium", "max_retries_per_shot": 3, "deadline_sec": 240 }
},
"shot_pattern": [
{ "id": "Setup", "type": "character", "sec": 6, "grade": "S" },
{ "id": "Beat1", "type": "character", "sec": 8, "grade": "S" },
{ "id": "Reaction", "type": "character", "sec": 6, "grade": "S" },
{ "id": "Beat2", "type": "character", "sec": 10, "grade": "S" },
{ "id": "Turn", "type": "vfx", "sec": 6, "grade": "A" },
{ "id": "Resolve", "type": "character", "sec": 12, "grade": "S" },
{ "id": "End", "type": "ui", "sec": 12, "grade": "A" }
],
"qc_gate": { "hard": ["identity", "stability", "readability"], "soft": ["on_brief"] }
}
C. QC 打分算法草案(MVP 可实现)
原则:先 Hard Gate(挡灾难),再 Soft Score(排序/优化)。MVP 不追求完美理解,追求“能归因、能修复、能稳定交付”。
C1. 指标与计算(要点)
- readability:字幕安全区/遮挡主体(字幕框与主体框 IoU)
- identity:锚点人脸 embedding 与关键帧相似度(min(sim))
- stability:相邻帧 SSIM/LPIPS 或光流一致性(平均)
- on_brief:关键帧 caption 与 intent embedding 相似度
- safe_area:规则校验(越界硬失败)
- audio_match(Beta):ASR 文本对齐 + beat 对齐误差
C2. 总分聚合(示例)
overall = 0.25*on_brief + 0.25*identity + 0.20*stability + 0.20*readability + 0.10*audio_match
D. 自动修复 Patch 规范(可执行)
修复 = 对 shotlist 的局部补丁(patch)。保证可回放、可审计。
复制{
"shot_id": "S01",
"reason": "identity_drift",
"actions": [
{ "op": "set", "path": "gen.sampling.n", "value": 3 },
{ "op": "set", "path": "gen.sampling.seed_policy", "value": "unlocked" },
{ "op": "append", "path": "gen.prompt.visual", "value": "keep same face, same outfit, consistent character identity" },
{ "op": "set", "path": "quality_target.min_score", "value": 0.86 }
]
}
E. 端到端 Job 示例(最小闭环)
artifacts/job_20251225_0001/
brief.json
template.json
style_bible/v1.json
shotlist/v1.json
assets/
anchors/char_01_sheet_v1.png
clips/S01_take1.mp4
clips/S02_take1.mp4
audio/vo.wav
audio/bgm.mp3
v0001/
edit_manifest.json
qc_report.json
render_log.txt
final.mp4
MVP 执行顺序:brief+template → 生成 bibles+shotlist → 并行出镜头 → 镜头级 QC(失败→patch→重试)→ edit_manifest → Remotion 渲染 → 最终 QC → 发布。