AI短视频自动化生产流水线

引言

从脚本撰写到最终发布,一条短视频的传统制作流程需要编剧、配音演员、分镜师、视频制作、剪辑师等多个角色协作,周期以天计。AI 技术的成熟使得这条流水线可以被高度自动化,将单条视频的制作时间压缩到分钟级别。本文详解端到端的 AI 短视频自动化流水线:Script -> Voice -> Storyboard -> Render -> Publish。

一、流水线架构总览

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Script  │───→│  Voice   │───→│Storyboard│───→│  Render  │───→│ Publish  │
│  脚本生成 │    │ 语音合成  │    │ 分镜绘制  │    │ 视频合成  │    │ 分发上架  │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
     │               │               │               │               │
   LLM API      TTS API        Image Gen       FFmpeg/ML        Platform API
  (GPT/Claude)  (OpenAI/       (DALL-E/        (帧合成/         (YouTube/
                 ElevenLabs)    Flux/Imagen)    转场/字幕)       抖音/B站)

1.1 各阶段职责与输入输出

阶段 输入 处理 输出
Script 主题/关键词 LLM 生成结构化脚本 JSON(旁白 + 分镜描述)
Voice 旁白文本 TTS 语音合成 WAV/MP3 音频文件
Storyboard 分镜描述 图像生成 PNG/JPG 分镜图片
Render 音频 + 图片 + 字幕 FFmpeg 合成 MP4 视频文件
Publish 视频 + 元数据 平台 API 上传 发布链接

二、阶段一:脚本生成(Script)

2.1 结构化脚本格式

脚本生成的关键在于输出结构化数据,而非自由文本:

{
  "title": "5个改变生活的AI工具",
  "duration_target": 60,
  "style": "informative",
  "scenes": [
    {
      "scene_id": 1,
      "narration": "你是否还在手动处理那些重复性的工作?今天介绍5个AI工具,让你的效率翻倍。",
      "visual_description": "Modern office desk with laptop, coffee cup, and scattered papers. Camera slowly pushes in. Warm morning light.",
      "duration": 8,
      "text_overlay": "5个改变生活的AI工具",
      "transition": "fade_in"
    },
    {
      "scene_id": 2,
      "narration": "第一个:ChatGPT。从写邮件到做方案,它就像你的私人助理。",
      "visual_description": "Split screen showing a person typing on left, AI-generated text appearing on right. Clean, minimal design.",
      "duration": 10,
      "text_overlay": "1. ChatGPT",
      "transition": "slide_left"
    }
  ],
  "music_style": "upbeat electronic, 120 BPM",
  "target_platform": "douyin"
}

2.2 LLM Prompt 设计

SCRIPT_SYSTEM_PROMPT = """
You are a short-video scriptwriter. Generate scripts in structured JSON format.

Rules:
1. Each scene has narration (voiceover text) and visual_description (image prompt)
2. Total duration should match duration_target (in seconds)
3. Each scene: 5-12 seconds
4. Narration: conversational, use hooks in first 3 seconds
5. Visual descriptions: specific, detailed, suitable for AI image generation
6. Include text_overlay for key points
7. Suggest transitions between scenes

Output format: JSON with scenes array as shown in the schema.
"""

def generate_script(topic: str, duration: int = 60, style: str = "informative") -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SCRIPT_SYSTEM_PROMPT},
            {"role": "user", "content": f"Topic: {topic}\nDuration: {duration}s\nStyle: {style}"}
        ],
        response_format={"type": "json_object"},
        temperature=0.8
    )
    return json.loads(response.choices[0].message.content)

2.3 脚本质量检查

def validate_script(script: dict) -> list[str]:
    issues = []
    total_duration = sum(s["duration"] for s in script["scenes"])

    # 时长检查
    if abs(total_duration - script["duration_target"]) > 5:
        issues.append(f"Duration mismatch: {total_duration}s vs target {script['duration_target']}s")

    # 开场钩子检查
    first_narration = script["scenes"][0]["narration"]
    if len(first_narration) > 50:
        issues.append("Opening narration too long, hook should be under 3 seconds")

    # 分镜描述质量
    for scene in script["scenes"]:
        if len(scene["visual_description"]) < 30:
            issues.append(f"Scene {scene['scene_id']}: visual description too brief")

    return issues

三、阶段二:语音合成(Voice)

3.1 TTS 方案对比

方案 质量 中文效果 延迟 成本
OpenAI TTS 优秀 良好 $15/100万字符
ElevenLabs 极好 一般 $5-330/月
Azure TTS 良好 优秀 $16/100万字符
火山引擎 TTS 良好 优秀 按量计费
Edge TTS 良好 良好 免费

3.2 语音合成实现

from openai import OpenAI
from pathlib import Path
import edge_tts
import asyncio

class VoiceSynthesizer:
    def __init__(self, provider: str = "openai"):
        self.provider = provider
        if provider == "openai":
            self.client = OpenAI()

    def synthesize(self, text: str, output_path: str,
                   voice: str = "alloy", speed: float = 1.0) -> float:
        """合成语音并返回音频时长(秒)"""
        if self.provider == "openai":
            response = self.client.audio.speech.create(
                model="tts-1-hd",
                voice=voice,  # alloy, echo, fable, onyx, nova, shimmer
                input=text,
                speed=speed
            )
            response.stream_to_file(output_path)

        elif self.provider == "edge":
            asyncio.run(self._edge_tts(text, output_path, voice))

        # 获取音频时长
        from mutagen.mp3 import MP3
        audio = MP3(output_path)
        return audio.info.length

    async def _edge_tts(self, text: str, output_path: str,
                         voice: str = "zh-CN-YunxiNeural"):
        communicate = edge_tts.Communicate(text, voice)
        await communicate.save(output_path)

    def synthesize_scenes(self, scenes: list[dict], output_dir: str) -> list[dict]:
        """批量合成所有场景的语音"""
        results = []
        for scene in scenes:
            filename = f"scene_{scene['scene_id']:02d}.mp3"
            filepath = str(Path(output_dir) / filename)
            duration = self.synthesize(scene["narration"], filepath)
            results.append({
                "scene_id": scene["scene_id"],
                "audio_path": filepath,
                "actual_duration": duration
            })
        return results

3.3 语音节奏控制

def adjust_narration_pacing(scenes: list[dict], voice_results: list[dict]) -> list[dict]:
    """根据实际语音时长调整分镜时长"""
    for scene, voice in zip(scenes, voice_results):
        actual = voice["actual_duration"]
        target = scene["duration"]

        if actual > target + 1:
            # 语音太长,扩展场景时长
            scene["duration"] = actual + 0.5
            scene["notes"] = f"Extended from {target}s to {scene['duration']}s"
        elif actual < target - 2:
            # 语音太短,可以加入停顿或缩短场景
            scene["duration"] = max(actual + 1.5, 5)
            scene["notes"] = f"Shortened from {target}s to {scene['duration']}s"

    return scenes

四、阶段三:分镜绘制(Storyboard)

4.1 图像生成策略

class StoryboardGenerator:
    def __init__(self, provider: str = "flux"):
        self.provider = provider

    def generate_frame(self, visual_desc: str, scene_id: int,
                       style_prompt: str = "", quality: str = "2k") -> str:
        """生成单帧分镜图片"""

        # 组合完整 prompt
        full_prompt = f"{visual_desc}. {style_prompt}"

        if self.provider == "flux":
            return self._generate_flux(full_prompt, scene_id, quality)
        elif self.provider == "imagen":
            return self._generate_imagen(full_prompt, scene_id, quality)

    def _generate_flux(self, prompt: str, scene_id: int, quality: str) -> str:
        # Flux API 调用
        size = "1920x1080" if quality == "2k" else "3840x2160"
        response = requests.post(
            "https://api.siliconflow.cn/v1/images/generations",
            headers={"Authorization": f"Bearer {SILICON_FLOW_KEY}"},
            json={
                "model": "black-forest-labs/FLUX.1-schnell",
                "prompt": prompt,
                "image_size": size,
                "num_inference_steps": 20
            }
        )
        image_url = response.json()["images"][0]["url"]
        # 下载并保存
        output_path = f"storyboard/frame_{scene_id:02d}.png"
        download_image(image_url, output_path)
        return output_path

    def generate_all_frames(self, scenes: list[dict],
                            style: str = "cinematic, high quality",
                            concurrency: int = 3) -> list[dict]:
        """并发生成所有分镜帧"""
        import concurrent.futures

        results = []
        with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
            futures = {
                executor.submit(
                    self.generate_frame,
                    scene["visual_description"],
                    scene["scene_id"],
                    style
                ): scene for scene in scenes
            }
            for future in concurrent.futures.as_completed(futures):
                scene = futures[future]
                frame_path = future.result()
                results.append({
                    "scene_id": scene["scene_id"],
                    "frame_path": frame_path
                })

        return sorted(results, key=lambda x: x["scene_id"])

4.2 风格一致性保证

多帧之间的风格一致性是分镜环节的核心挑战:

STYLE_CONSISTENCY_PROMPT = """
Consistent visual style throughout: same color palette (warm earth tones),
same lighting direction (golden hour from left), same artistic style
(photorealistic cinematic), same camera characteristics (35mm film grain,
shallow depth of field). Maintain visual coherence across all frames.
"""

# 方法 1:全局风格前缀
def build_consistent_prompt(visual_desc: str, global_style: str) -> str:
    return f"{global_style}. {visual_desc}"

# 方法 2:使用 IP-Adapter(图像风格参考)
# 生成第一帧后,将其作为后续帧的风格参考

# 方法 3:LoRA 微调
# 对特定风格训练 LoRA,确保所有帧使用相同 LoRA

五、阶段四:视频合成(Render)

5.1 FFmpeg 合成管线

import subprocess
import json

class VideoRenderer:
    def __init__(self, fps: int = 30, resolution: str = "1920x1080"):
        self.fps = fps
        self.resolution = resolution

    def render_scene(self, frame_path: str, audio_path: str,
                     duration: float, output_path: str,
                     transition: str = "fade") -> str:
        """将单帧图片 + 音频合成为视频片段"""

        # Ken Burns 效果(图片缓慢缩放/平移,避免静态感)
        filter_complex = self._ken_burns_filter(duration)

        cmd = [
            "ffmpeg", "-y",
            "-loop", "1", "-i", frame_path,
            "-i", audio_path,
            "-filter_complex", filter_complex,
            "-c:v", "libx264", "-preset", "medium",
            "-c:a", "aac", "-b:a", "192k",
            "-t", str(duration),
            "-pix_fmt", "yuv420p",
            output_path
        ]
        subprocess.run(cmd, check=True, capture_output=True)
        return output_path

    def _ken_burns_filter(self, duration: float) -> str:
        """生成 Ken Burns(缓慢推进)效果的 FFmpeg filter"""
        # 从 100% 缓慢缩放到 110%,同时轻微平移
        zoom_speed = 0.001
        return (
            f"[0:v]scale=8000:-1,"
            f"zoompan=z='min(zoom+{zoom_speed},1.1)':"
            f"x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':"
            f"d={int(duration * self.fps)}:s={self.resolution}:fps={self.fps}[v];"
            f"[v]format=yuv420p[outv]"
        )

    def add_subtitles(self, video_path: str, subtitles: list[dict],
                      output_path: str) -> str:
        """为视频添加字幕"""
        # 生成 ASS 字幕文件
        ass_content = self._generate_ass(subtitles)
        ass_path = video_path.replace(".mp4", ".ass")
        with open(ass_path, "w", encoding="utf-8") as f:
            f.write(ass_content)

        cmd = [
            "ffmpeg", "-y",
            "-i", video_path,
            "-vf", f"ass={ass_path}",
            "-c:a", "copy",
            output_path
        ]
        subprocess.run(cmd, check=True, capture_output=True)
        return output_path

    def concatenate_scenes(self, scene_videos: list[str],
                           output_path: str,
                           transitions: list[str] = None) -> str:
        """拼接所有场景视频"""
        # 生成 concat 文件
        concat_file = "concat_list.txt"
        with open(concat_file, "w") as f:
            for video in scene_videos:
                f.write(f"file '{video}'\n")

        cmd = [
            "ffmpeg", "-y",
            "-f", "concat", "-safe", "0",
            "-i", concat_file,
            "-c:v", "libx264", "-preset", "medium",
            "-c:a", "aac",
            output_path
        ]
        subprocess.run(cmd, check=True, capture_output=True)
        return output_path

    def add_background_music(self, video_path: str, music_path: str,
                             volume: float = 0.15, output_path: str = None) -> str:
        """添加背景音乐"""
        if output_path is None:
            output_path = video_path.replace(".mp4", "_with_music.mp4")

        cmd = [
            "ffmpeg", "-y",
            "-i", video_path,
            "-i", music_path,
            "-filter_complex",
            f"[1:a]volume={volume}[bg];[0:a][bg]amix=inputs=2:duration=first[out]",
            "-map", "0:v", "-map", "[out]",
            "-c:v", "copy", "-c:a", "aac",
            "-shortest",
            output_path
        ]
        subprocess.run(cmd, check=True, capture_output=True)
        return output_path

5.2 转场效果

TRANSITIONS = {
    "fade": "fade=t=in:st=0:d=0.5,fade=t=out:st={end}:d=0.5",
    "slide_left": "xfade=transition=slideleft:duration=0.5",
    "slide_right": "xfade=transition=slideright:duration=0.5",
    "dissolve": "xfade=transition=dissolve:duration=0.8",
    "wipe_right": "xfade=transition=wiperight:duration=0.5",
    "zoom_in": "xfade=transition=zoomin:duration=0.5",
}

六、阶段五:分发发布(Publish)

6.1 多平台适配

PLATFORM_SPECS = {
    "douyin": {
        "max_duration": 600,
        "aspect_ratios": ["9:16", "16:9", "1:1"],
        "preferred_ratio": "9:16",
        "max_file_size": "4GB",
        "resolution": "1080x1920",
        "format": "mp4"
    },
    "bilibili": {
        "max_duration": 3600,
        "aspect_ratios": ["16:9", "4:3"],
        "preferred_ratio": "16:9",
        "max_file_size": "8GB",
        "resolution": "1920x1080",
        "format": "mp4"
    },
    "youtube": {
        "max_duration": 43200,
        "aspect_ratios": ["16:9", "9:16"],
        "preferred_ratio": "16:9",
        "max_file_size": "256GB",
        "resolution": "3840x2160",
        "format": "mp4"
    }
}

def adapt_for_platform(video_path: str, platform: str) -> str:
    """根据平台规格适配视频"""
    spec = PLATFORM_SPECS[platform]
    output_path = video_path.replace(".mp4", f"_{platform}.mp4")

    # 调整分辨率和宽高比
    cmd = [
        "ffmpeg", "-y", "-i", video_path,
        "-vf", f"scale={spec['resolution'].replace('x', ':')}"
               f":force_original_aspect_ratio=decrease,"
               f"pad={spec['resolution'].replace('x', ':')}:(ow-iw)/2:(oh-ih)/2",
        "-c:a", "copy",
        output_path
    ]
    subprocess.run(cmd, check=True)
    return output_path

七、端到端编排

7.1 Pipeline 编排器

class VideoPipeline:
    def __init__(self, config: dict):
        self.script_gen = ScriptGenerator(config.get("llm_model", "gpt-4o"))
        self.voice_syn = VoiceSynthesizer(config.get("tts_provider", "openai"))
        self.storyboard = StoryboardGenerator(config.get("image_provider", "flux"))
        self.renderer = VideoRenderer()
        self.output_dir = config.get("output_dir", "./output")

    def run(self, topic: str, duration: int = 60) -> dict:
        """执行完整流水线"""
        os.makedirs(self.output_dir, exist_ok=True)
        timeline = {}

        # Step 1: 脚本生成
        script = self.script_gen.generate(topic, duration)
        timeline["script"] = time.time()

        # Step 2: 语音合成
        voice_results = self.voice_syn.synthesize_scenes(
            script["scenes"], f"{self.output_dir}/audio"
        )
        timeline["voice"] = time.time()

        # Step 3: 分镜绘制(与语音并行优化)
        frame_results = self.storyboard.generate_all_frames(
            script["scenes"], concurrency=3
        )
        timeline["storyboard"] = time.time()

        # Step 4: 视频合成
        scene_videos = []
        for scene, voice, frame in zip(script["scenes"], voice_results, frame_results):
            video_path = self.renderer.render_scene(
                frame["frame_path"], voice["audio_path"],
                voice["actual_duration"],
                f"{self.output_dir}/scene_{scene['scene_id']:02d}.mp4"
            )
            scene_videos.append(video_path)

        final_video = self.renderer.concatenate_scenes(
            scene_videos, f"{self.output_dir}/final.mp4"
        )
        timeline["render"] = time.time()

        return {
            "video_path": final_video,
            "script": script,
            "timeline": timeline,
            "total_time": timeline["render"] - timeline["script"]
        }

7.2 错误恢复与重试

class ResilientPipeline(VideoPipeline):
    def run_with_recovery(self, topic: str, max_retries: int = 3) -> dict:
        """带断点续传的流水线"""
        checkpoint_path = f"{self.output_dir}/checkpoint.json"

        # 加载检查点
        checkpoint = self._load_checkpoint(checkpoint_path)

        stages = ["script", "voice", "storyboard", "render"]
        start_stage = checkpoint.get("completed_stage", -1) + 1

        for i in range(start_stage, len(stages)):
            stage = stages[i]
            for attempt in range(max_retries):
                try:
                    self._execute_stage(stage, checkpoint)
                    checkpoint["completed_stage"] = i
                    self._save_checkpoint(checkpoint, checkpoint_path)
                    break
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise RuntimeError(f"Stage {stage} failed after {max_retries} retries: {e}")
                    time.sleep(2 ** attempt)  # 指数退避

八、生产环境优化

8.1 性能基准

阶段 单场景耗时 10场景总计 可并行度
脚本生成 3-8s 3-8s 1x(单次调用)
语音合成 2-5s 5-15s 高(可全并行)
分镜绘制 5-30s 20-60s 中(受 API 限速)
视频合成 10-30s 100-300s 中(受 CPU/GPU)
总计 - 2-6 分钟 -

8.2 成本控制

单条 60 秒短视频成本估算(10 个场景):
- 脚本生成(GPT-4o)      : ~$0.05
- 语音合成(OpenAI TTS)   : ~$0.02
- 分镜绘制(Flux Schnell) : ~$0.10
- 背景音乐(Suno/Udio)   : ~$0.10
- 计算资源(FFmpeg)       : ~$0.01
────────────────────────────────
总计:约 $0.28/条(约 ¥2/条)

总结

AI 短视频自动化流水线已经具备商业可行性。核心挑战不在单个环节的 AI 能力(已经足够好),而在于:(1)各环节的数据格式对齐,(2)风格一致性保证,(3)错误恢复与质量把控。随着视频生成模型的持续进化,流水线中的"分镜绘制"环节将逐步被端到端视频生成取代,但"脚本-语音-合成-发布"的流水线架构将长期有效。


Maurice | maurice_wen@proton.me