视频理解与分析:多模态 AI 应用

Video QA、时序定位、动作识别与视频摘要——大模型时代的视频智能


一、从"看视频"到"理解视频"

传统的视频处理停留在像素级操作:裁剪、滤镜、转码。而视频理解要解决的问题是:让机器像人一样"看懂"视频内容——谁在做什么?什么时候发生了什么?视频在讲什么故事?

2024-2025 年,多模态大模型(Gemini、GPT-4o、Claude)的出现让视频理解发生了范式转移。过去需要精心训练专用模型的任务,现在可以用通用的多模态 API 一次性解决。

视频理解的核心任务

任务 输入 输出 应用场景
Video QA 视频 + 自然语言问题 自然语言答案 视频搜索、客服
时序定位 视频 + 描述 时间戳区间 视频剪辑、关键片段定位
动作识别 视频片段 动作类别 运动分析、安防监控
视频摘要 长视频 文字摘要/关键帧 会议记录、内容审核
视频描述 视频 自然语言描述 无障碍、内容标注

二、多模态大模型的视频理解能力

2.1 Gemini 视频理解

Gemini 是目前原生支持视频输入最完善的多模态模型。它可以直接处理视频文件,不需要手动抽帧。

# gemini_video_qa.py
import google.generativeai as genai
import time
from pathlib import Path


def upload_video(video_path: str) -> genai.File:
    """Upload video to Gemini Files API."""
    video_file = genai.upload_file(
        path=video_path,
        display_name=Path(video_path).name,
    )

    # Wait for processing
    while video_file.state.name == 'PROCESSING':
        time.sleep(5)
        video_file = genai.get_file(video_file.name)

    if video_file.state.name == 'FAILED':
        raise RuntimeError(f"Video processing failed: {video_file.state}")

    return video_file


def video_qa(
    video_path: str,
    question: str,
    model: str = 'gemini-2.5-flash',
) -> str:
    """Ask a question about a video using Gemini."""
    video_file = upload_video(video_path)

    model = genai.GenerativeModel(model)
    response = model.generate_content(
        [video_file, question],
        generation_config=genai.GenerationConfig(
            temperature=0.2,
            max_output_tokens=2048,
        ),
    )

    return response.text


def video_summary(
    video_path: str,
    detail_level: str = 'medium',  # brief/medium/detailed
) -> str:
    """Generate a structured summary of a video."""
    prompts = {
        'brief': (
            "Summarize this video in 2-3 sentences. "
            "Focus on the main topic and key takeaway."
        ),
        'medium': (
            "Provide a structured summary of this video:\n"
            "1. Main topic (1 sentence)\n"
            "2. Key points (3-5 bullet points)\n"
            "3. Notable visual elements\n"
            "4. Overall tone and style"
        ),
        'detailed': (
            "Create a detailed timeline summary:\n"
            "- For each major segment, provide timestamp range and description\n"
            "- List all speakers/characters if any\n"
            "- Describe visual transitions and scene changes\n"
            "- Note any text, logos, or graphics shown\n"
            "- Summarize the audio (speech, music, effects)"
        ),
    }

    return video_qa(video_path, prompts[detail_level])


# Usage
if __name__ == '__main__':
    genai.configure(api_key='YOUR_API_KEY')

    # Video QA
    answer = video_qa(
        'lecture.mp4',
        'What are the three main arguments presented in this lecture?'
    )
    print(f"Answer: {answer}")

    # Video summary
    summary = video_summary('meeting_recording.mp4', detail_level='medium')
    print(f"Summary:\n{summary}")

2.2 GPT-4o 视频理解(帧序列方式)

GPT-4o 不直接接受视频文件,但可以通过传递帧序列来实现视频理解:

# gpt4o_video.py
import base64
import cv2
from openai import OpenAI


def extract_frames_base64(
    video_path: str,
    max_frames: int = 32,
    resize: tuple[int, int] = (512, 512),
) -> list[str]:
    """
    Extract evenly-spaced frames from video as base64 strings.

    GPT-4o token cost: ~85 tokens per low-res image, ~170 for high-res.
    32 frames at low-res = ~2720 tokens.
    """
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = cap.get(cv2.CAP_PROP_FPS)

    # Calculate frame indices to sample
    indices = [int(i * total_frames / max_frames) for i in range(max_frames)]

    frames_b64 = []
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if not ret:
            continue

        # Resize for token efficiency
        frame = cv2.resize(frame, resize)

        # Encode to base64
        _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
        b64 = base64.b64encode(buffer).decode('utf-8')
        frames_b64.append(b64)

    cap.release()
    return frames_b64


def gpt4o_video_qa(
    video_path: str,
    question: str,
    max_frames: int = 32,
) -> str:
    """Video QA using GPT-4o with frame sequence."""
    client = OpenAI()
    frames = extract_frames_base64(video_path, max_frames)

    # Build message with interleaved frames
    content = [
        {
            "type": "text",
            "text": (
                f"I'm showing you {len(frames)} evenly-sampled frames "
                f"from a video. Please analyze the video and answer: {question}"
            ),
        },
    ]

    for i, frame_b64 in enumerate(frames):
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{frame_b64}",
                "detail": "low",  # Save tokens
            },
        })

    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": content}],
        max_tokens=2048,
        temperature=0.2,
    )

    return response.choices[0].message.content

2.3 两种方案对比

特性 Gemini (原生视频) GPT-4o (帧序列)
输入方式 直接上传视频文件 手动抽帧为图片序列
时间感知 原生理解时间线 依赖帧间推理
音频理解 支持语音+音效 不支持(需额外 Whisper)
Token 成本 按视频时长计费 按图片数量计费
最大时长 ~1 小时 受 token 限制(~50 帧)
精度 时间定位更准确 粗粒度(取决于抽帧密度)

三、时序定位(Temporal Grounding)

时序定位回答的问题是:"视频中的第 X 分 Y 秒到第 M 分 N 秒发生了什么?"或者反过来:"某个事件发生在视频的什么时刻?"

3.1 基于 LLM 的时序定位

# temporal_grounding.py

def temporal_grounding(
    video_path: str,
    query: str,
    model: str = 'gemini-2.5-flash',
) -> list[dict]:
    """
    Find time segments in video matching a natural language query.

    Returns list of {start, end, description, confidence}.
    """
    prompt = f"""Analyze this video and find all segments that match: "{query}"

For each matching segment, provide:
- start_time: in seconds (float)
- end_time: in seconds (float)
- description: what happens in this segment
- confidence: how confident you are (0.0-1.0)

Output as JSON array. If no segments match, return empty array [].
Example:
[
  {{"start_time": 12.5, "end_time": 18.0, "description": "...", "confidence": 0.9}}
]"""

    video_file = upload_video(video_path)
    model_instance = genai.GenerativeModel(model)

    response = model_instance.generate_content(
        [video_file, prompt],
        generation_config=genai.GenerationConfig(
            temperature=0.1,
            response_mime_type='application/json',
        ),
    )

    import json
    segments = json.loads(response.text)
    return segments


def create_highlight_reel(
    video_path: str,
    segments: list[dict],
    output_path: str,
) -> None:
    """
    Create a highlight video from temporal grounding results.
    Extracts and concatenates matching segments.
    """
    import subprocess
    import tempfile

    temp_clips = []
    for i, seg in enumerate(segments):
        clip_path = f"/tmp/highlight_{i}.mp4"
        cmd = [
            'ffmpeg', '-y',
            '-ss', str(seg['start_time']),
            '-i', video_path,
            '-t', str(seg['end_time'] - seg['start_time']),
            '-c', 'copy',
            '-avoid_negative_ts', 'make_zero',
            clip_path
        ]
        subprocess.run(cmd, check=True, capture_output=True)
        temp_clips.append(clip_path)

    # Concatenate clips
    list_file = '/tmp/highlight_list.txt'
    with open(list_file, 'w') as f:
        for clip in temp_clips:
            f.write(f"file '{clip}'\n")

    cmd = [
        'ffmpeg', '-y',
        '-f', 'concat', '-safe', '0',
        '-i', list_file,
        '-c', 'copy',
        output_path
    ]
    subprocess.run(cmd, check=True, capture_output=True)

    # Cleanup
    for clip in temp_clips:
        Path(clip).unlink(missing_ok=True)
    Path(list_file).unlink(missing_ok=True)

四、动作识别(Action Recognition)

4.1 传统方法 vs 大模型方法

传统动作识别使用专用模型(如 VideoMAE、TimeSformer),在预定义的动作类别上训练。大模型方法则用自然语言描述动作,实现开放词汇的识别。

# action_recognition.py

# Method 1: Specialized model (VideoMAE)
from transformers import VideoMAEForVideoClassification, AutoProcessor
import torch


class VideoMAERecognizer:
    """Action recognition using VideoMAE (fine-tuned on Kinetics-400)."""

    def __init__(self):
        self.model_name = 'MCG-NJU/videomae-base-finetuned-kinetics'
        self.processor = AutoProcessor.from_pretrained(self.model_name)
        self.model = VideoMAEForVideoClassification.from_pretrained(
            self.model_name
        )

    def recognize(self, frames: list[np.ndarray]) -> list[dict]:
        """
        Classify action from a sequence of frames.
        Input: 16 frames, each as numpy array (H, W, 3).
        """
        inputs = self.processor(list(frames), return_tensors='pt')

        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=-1)[0]

        # Top 5 predictions
        top5_idx = probs.topk(5).indices.tolist()
        results = []
        for idx in top5_idx:
            results.append({
                'action': self.model.config.id2label[idx],
                'confidence': float(probs[idx]),
            })

        return results


# Method 2: LLM-based open-vocabulary recognition
def llm_action_recognition(
    video_path: str,
    context: str = '',
) -> list[dict]:
    """
    Open-vocabulary action recognition using multimodal LLM.
    No predefined categories needed.
    """
    prompt = f"""Analyze the actions/activities happening in this video.

For each distinct action, provide:
- action: descriptive name of the action
- actor: who/what is performing it
- start_time: approximate start in seconds
- end_time: approximate end in seconds
- confidence: your confidence (0.0-1.0)

{f"Context: {context}" if context else ""}

Output as JSON array."""

    video_file = upload_video(video_path)
    model = genai.GenerativeModel('gemini-2.5-flash')

    response = model.generate_content(
        [video_file, prompt],
        generation_config=genai.GenerationConfig(
            temperature=0.1,
            response_mime_type='application/json',
        ),
    )

    import json
    return json.loads(response.text)

4.2 选型建议

场景 推荐方法 原因
固定类别(运动分析) VideoMAE / 专用模型 高精度、低延迟、可离线
开放类别(内容理解) Gemini / GPT-4o 无需训练、灵活性高
实时检测(安防) SlowFast + YOLO 低延迟、可边缘部署
细粒度分析(体育) 专用模型 + 姿态估计 需要关节级精度

五、视频摘要生成

5.1 多层级摘要策略

长视频(如 1 小时会议录像)不能直接丢给 LLM——即使 Gemini 支持长视频,一次性处理的输出质量也会下降。更好的方法是分段处理、层级聚合。

# video_summarizer.py

class HierarchicalSummarizer:
    """
    Multi-level video summarization for long videos.

    Level 1: Scene-level summaries (per scene)
    Level 2: Segment summaries (groups of scenes)
    Level 3: Full video summary (aggregation)
    """

    def __init__(self, model: str = 'gemini-2.5-flash'):
        self.model = genai.GenerativeModel(model)

    async def summarize_long_video(
        self,
        video_path: str,
        max_segment_minutes: int = 5,
    ) -> dict:
        """Full pipeline for long video summarization."""
        # Step 1: Detect scenes
        scenes = detect_scenes(video_path)
        duration_info = probe(video_path)
        total_duration = duration_info.duration

        # Step 2: Group scenes into segments (~5 min each)
        segments = self._group_scenes(
            scenes, max_segment_seconds=max_segment_minutes * 60
        )

        # Step 3: Summarize each segment
        segment_summaries = []
        for seg in segments:
            clip_path = f"/tmp/seg_{seg['index']}.mp4"
            trim(video_path, clip_path, seg['start'], seg['duration'])
            summary = await self._summarize_segment(clip_path, seg)
            segment_summaries.append(summary)

        # Step 4: Aggregate into final summary
        final_summary = await self._aggregate_summaries(
            segment_summaries, total_duration
        )

        return {
            'total_duration': total_duration,
            'segments': segment_summaries,
            'summary': final_summary,
        }

    async def _summarize_segment(
        self, clip_path: str, segment_info: dict
    ) -> dict:
        """Summarize a single video segment."""
        video_file = upload_video(clip_path)

        prompt = (
            "Summarize this video segment concisely:\n"
            "- Main topics discussed/shown\n"
            "- Key information or decisions\n"
            "- Notable visual elements\n"
            "Keep it under 100 words."
        )

        response = self.model.generate_content(
            [video_file, prompt],
            generation_config=genai.GenerationConfig(temperature=0.2),
        )

        return {
            'index': segment_info['index'],
            'time_range': f"{segment_info['start']:.0f}s - "
                          f"{segment_info['start'] + segment_info['duration']:.0f}s",
            'summary': response.text,
        }

    async def _aggregate_summaries(
        self, segment_summaries: list[dict], total_duration: float
    ) -> str:
        """Create final summary from segment summaries."""
        segments_text = '\n\n'.join(
            f"[{s['time_range']}]\n{s['summary']}"
            for s in segment_summaries
        )

        prompt = f"""Based on these segment summaries of a {total_duration/60:.0f}-minute video,
create a comprehensive summary:

{segments_text}

Provide:
1. Executive summary (2-3 sentences)
2. Key points (5-7 bullet points)
3. Timeline of major topics
4. Action items or conclusions (if applicable)"""

        response = self.model.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(temperature=0.3),
        )

        return response.text

    @staticmethod
    def _group_scenes(
        scenes: list, max_segment_seconds: int
    ) -> list[dict]:
        """Group consecutive scenes into segments."""
        segments = []
        current_start = 0.0
        current_duration = 0.0
        current_scenes = []

        for scene in scenes:
            if current_duration + scene.duration > max_segment_seconds:
                if current_scenes:
                    segments.append({
                        'index': len(segments),
                        'start': current_start,
                        'duration': current_duration,
                        'scene_count': len(current_scenes),
                    })
                current_start = scene.start_time
                current_duration = 0.0
                current_scenes = []

            current_scenes.append(scene)
            current_duration += scene.duration

        if current_scenes:
            segments.append({
                'index': len(segments),
                'start': current_start,
                'duration': current_duration,
                'scene_count': len(current_scenes),
            })

        return segments

六、实际应用场景

6.1 会议录像智能分析

输入: 1小时 Zoom 会议录像
  |
  v
[Whisper 转写] -> 全文逐字稿 + 时间戳
  |
  v
[说话人分离 (pyannote)] -> 识别不同参会者
  |
  v
[LLM 摘要] -> 议题/决策/待办
  |
  v
输出:
  - 会议纪要 (Markdown)
  - 按议题索引的视频片段
  - 待办事项清单

6.2 教育视频知识图谱

输入: 系列教学视频
  |
  v
[视频理解] -> 每课知识点提取
  |
  v
[知识关联] -> 跨课程概念连接
  |
  v
输出:
  - 知识图谱 (concept -> prerequisite -> next_concept)
  - 自动生成章节目录
  - 基于用户问题的精确片段推荐

6.3 电商视频自动标注

输入: 商品展示视频
  |
  v
[物体检测] -> 识别商品品类
  |
  v
[属性提取] -> 颜色/材质/尺寸
  |
  v
[场景理解] -> 使用场景/风格
  |
  v
输出:
  - 自动生成商品描述
  - 关键帧作为商品图片
  - SEO 关键词

七、成本与性能考量

Token/价格估算

模型 视频输入成本 1 分钟视频约
Gemini 2.5 Flash 按视频帧数计 ~$0.01-0.05
Gemini 2.5 Pro 按视频帧数计 ~$0.10-0.50
GPT-4o (32 帧) ~2720 tokens ~$0.01
GPT-4o (100 帧) ~8500 tokens ~$0.04

延迟优化策略

  1. 抽帧密度自适应:静态场景降低帧率,动态场景增加帧率
  2. 两阶段流程:先用 Flash 模型快速过滤,对需要深度分析的内容再用 Pro 模型
  3. 缓存:相同视频的分析结果缓存 24 小时
  4. 预处理:上传时即开始抽帧和 Whisper 转写,不等用户发起查询

质量提升技巧

  • 在 prompt 中提供视频的上下文(来源、类型、语言)
  • 要求结构化输出(JSON),便于后续处理
  • 对关键场景要求模型给出置信度
  • 用 Whisper 转写结果补充 LLM 的视觉理解(多模态互补)

八、技术趋势

视频理解正在从"给模型看帧"走向"原生视频推理":

  1. 更长的视频上下文:Gemini 已支持 1 小时视频,未来将扩展到数小时
  2. 实时视频理解:结合流式推理,实现直播内容的实时分析
  3. 视频 Agent:不仅理解视频内容,还能基于理解执行操作(剪辑、标注、回复)
  4. 多模态检索:用自然语言搜索视频库中的特定片段

视频是信息密度最高的媒介。当 AI 能真正"看懂"视频时,它将解锁从教育到安防、从电商到医疗的无限应用场景。


Maurice | maurice_wen@proton.me