SOTA 视频智能体:最佳工程实践白皮书 (2025 Edition)
AI 导读
SOTA ENGINEER Core Architecture 1. Neuro-Symbolic 架构 2. 智能路由与协议 Implementation (Python) 3. Brain: Opus Agent 4. Skills: MCP Tools 5. Quality Assurance Implementation (Remotion) 6. Body: Renderer SOTA...
SOTA 视频智能体:最佳工程实践
基于 Claude Agentic SDK (Python) 与 Model Context Protocol (MCP) 的企业级视频生成系统落地规范。 集成 Kling 2.6, Seedance 1.5, Veo 3.1 与 Gemini 3。
1. Neuro-Symbolic 神经符号架构
Opus 4.5 (Orchestrator)
利用 Agentic SDK 构建的决策中枢。它不处理像素,只处理逻辑与调度。负责拆解剧本、路由模型、处理错误重试。
SOTA Tools (MCP)
通过 MCP 协议标准化的工具链:
• Flux/Gemini: 视觉定妆 (Anchoring)
• Kling/Seedance: 动态生成 (Production)
• Gemini 3 Flash: 视觉质检 (QA)
Remotion (Assembler)
确定性渲染引擎。将“不可控”的 AI 视频素材与“可控”的代码字幕/UI 进行像素级组装。
2. 智能路由与工程协议
为了保证工业级的一致性和可用性,系统必须强制遵循以下协议:
❌ 禁止:直接 Text-to-Video (导致角色频繁换脸)。
✅ 强制:Prompt -> Flux/Gemini -> Anchor Image -> Kling I2V -> Video。
• Immutable: 文件名即哈希 (e.g., `a1b2c3d4.mp4`),存入 assets 目录后永不修改。
• Versioning: 仅更新 JSON 清单版本 (`v1.json` -> `v2.json`) 来指向新的素材文件。
3. Python: Agent 主循环 (The Brain)
基于 anthropic 原生 SDK 实现的主循环。Opus 4.5 在此思考。
from anthropic import Anthropic
from skills import TOOLS_DEFINITIONS, execute_tool
client = Anthropic()
# SOTA System Prompt: 注入了架构师思维
SYSTEM_PROMPT = """
You are the SOTA Video Director.
PROTOCOL:
1. **Routing**:
- Character Action -> Kling 2.6 (Must use Anchor Image).
- Dialogue -> Seedance 1.5 (Native Audio Sync).
- B-Roll/World -> Veo 3.1.
2. **QA Loop**: After generating a clip, call `inspect_quality`. If fail, retry with v2 parameters.
3. **Assembly**: Final output is a `props.json` for Remotion.
"""
def run_agent(user_request, job_id):
messages = [{"role": "user", "content": user_request}]
while True:
# 1. Opus 规划
response = client.messages.create(
model="claude-3-5-opus-20240620", # Opus 4.5 placeholder
max_tokens=4096,
system=SYSTEM_PROMPT,
messages=messages,
tools=TOOLS_DEFINITIONS
)
# 2. 工具执行循环
if response.stop_reason == "tool_use":
for block in response.content:
if block.type == "tool_use":
print(f"⚙️ Calling Skill: {block.name}")
result = execute_tool(block.name, block.input, job_id)
# ... (Append result to messages)
4. MCP Skills: 封装 SOTA 模型
通过 Python 封装 Kling、Seedance 等 API,对外暴露为 Agent 工具。
import fal_client
# 假设这是 Seedance 官方或封装的 SDK
import seedance_sdk
def execute_tool(name, args, job_id):
# === Skill: 动态视频生成 (路由逻辑) ===
if name == "generate_video_clip":
mode = args.get("mode", "action")
# 路由 A: 角色表演 (Kling 2.6)
if mode == "action":
if not args.get("anchor_url"):
return "ERROR: Missing anchor_url for character video."
res = fal_client.submit("kling-ai/kling-v1/i2v", {
"prompt": args["prompt"],
"image_url": args["anchor_url"],
"duration": "5s"
})
return save_asset(job_id, res["video"]["url"])
# 路由 B: 对话/口型 (Seedance 1.5)
elif mode == "dialogue":
# Seedance 1.5 支持原生音画同步
res = seedance_sdk.generate(
prompt=args["prompt"],
audio_driven=True,
voice_id="en_us_male_1"
)
return save_asset(job_id, res["url"])
5. VLM-as-a-Judge (视觉质检)
这是保证良品率的关键。在视频交付给 Remotion 之前,先让 Gemini 3 检查一遍。
def inspect_quality(video_path):
# 1. 抽取关键帧 (首、中、尾)
frames = extract_frames(video_path, count=3)
# 2. 调用 Gemini 3 Flash 进行快速评分
prompt = "Do these frames show a distorted human face? Is it a black screen? Reply YES/NO."
response = gemini_client.generate_content([prompt, *frames])
# 3. 决策
if "YES" in response.text:
return {"passed": False, "reason": "Distortion detected"}
return {"passed": True}
6. Remotion: 确定性渲染 (The Body)
TypeScript 组件,负责最后的组装。它确保了字幕和 UI 的绝对清晰和准确。
import { AbsoluteFill, Sequence, Video } from "remotion";
export const SotaComposition = ({ clips, subtitles }) => {
return (
<AbsoluteFill style={{ backgroundColor: "#000" }}>
{/* Layer 1: AI 视频层 */}
{clips.map((clip, i) => (
<Sequence key={i} from={clip.startFrame} durationInFrames={clip.duration}>
<Video src={clip.path} />
</Sequence>
))}
{/* Layer 2: 代码字幕层 (无幻觉) */}
{subtitles.map((sub, i) => (
<Sequence key={`s-${i}`} from={sub.startFrame} durationInFrames={sub.duration}>
<div className="subtitle">{sub.text}</div>
</Sequence>
))}
</AbsoluteFill>
);
};