视频理解与分析:多模态 AI 应用
AI 导读
视频理解与分析:多模态 AI 应用 Video QA、时序定位、动作识别与视频摘要——大模型时代的视频智能 一、从"看视频"到"理解视频" 传统的视频处理停留在像素级操作:裁剪、滤镜、转码。而视频理解要解决的问题是:让机器像人一样"看懂"视频内容——谁在做什么?什么时候发生了什么?视频在讲什么故事? 2024-2025...
视频理解与分析:多模态 AI 应用
Video QA、时序定位、动作识别与视频摘要——大模型时代的视频智能
一、从"看视频"到"理解视频"
传统的视频处理停留在像素级操作:裁剪、滤镜、转码。而视频理解要解决的问题是:让机器像人一样"看懂"视频内容——谁在做什么?什么时候发生了什么?视频在讲什么故事?
2024-2025 年,多模态大模型(Gemini、GPT-4o、Claude)的出现让视频理解发生了范式转移。过去需要精心训练专用模型的任务,现在可以用通用的多模态 API 一次性解决。
视频理解的核心任务
| 任务 | 输入 | 输出 | 应用场景 |
|---|---|---|---|
| Video QA | 视频 + 自然语言问题 | 自然语言答案 | 视频搜索、客服 |
| 时序定位 | 视频 + 描述 | 时间戳区间 | 视频剪辑、关键片段定位 |
| 动作识别 | 视频片段 | 动作类别 | 运动分析、安防监控 |
| 视频摘要 | 长视频 | 文字摘要/关键帧 | 会议记录、内容审核 |
| 视频描述 | 视频 | 自然语言描述 | 无障碍、内容标注 |
二、多模态大模型的视频理解能力
2.1 Gemini 视频理解
Gemini 是目前原生支持视频输入最完善的多模态模型。它可以直接处理视频文件,不需要手动抽帧。
# gemini_video_qa.py
import google.generativeai as genai
import time
from pathlib import Path
def upload_video(video_path: str) -> genai.File:
"""Upload video to Gemini Files API."""
video_file = genai.upload_file(
path=video_path,
display_name=Path(video_path).name,
)
# Wait for processing
while video_file.state.name == 'PROCESSING':
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == 'FAILED':
raise RuntimeError(f"Video processing failed: {video_file.state}")
return video_file
def video_qa(
video_path: str,
question: str,
model: str = 'gemini-2.5-flash',
) -> str:
"""Ask a question about a video using Gemini."""
video_file = upload_video(video_path)
model = genai.GenerativeModel(model)
response = model.generate_content(
[video_file, question],
generation_config=genai.GenerationConfig(
temperature=0.2,
max_output_tokens=2048,
),
)
return response.text
def video_summary(
video_path: str,
detail_level: str = 'medium', # brief/medium/detailed
) -> str:
"""Generate a structured summary of a video."""
prompts = {
'brief': (
"Summarize this video in 2-3 sentences. "
"Focus on the main topic and key takeaway."
),
'medium': (
"Provide a structured summary of this video:\n"
"1. Main topic (1 sentence)\n"
"2. Key points (3-5 bullet points)\n"
"3. Notable visual elements\n"
"4. Overall tone and style"
),
'detailed': (
"Create a detailed timeline summary:\n"
"- For each major segment, provide timestamp range and description\n"
"- List all speakers/characters if any\n"
"- Describe visual transitions and scene changes\n"
"- Note any text, logos, or graphics shown\n"
"- Summarize the audio (speech, music, effects)"
),
}
return video_qa(video_path, prompts[detail_level])
# Usage
if __name__ == '__main__':
genai.configure(api_key='YOUR_API_KEY')
# Video QA
answer = video_qa(
'lecture.mp4',
'What are the three main arguments presented in this lecture?'
)
print(f"Answer: {answer}")
# Video summary
summary = video_summary('meeting_recording.mp4', detail_level='medium')
print(f"Summary:\n{summary}")
2.2 GPT-4o 视频理解(帧序列方式)
GPT-4o 不直接接受视频文件,但可以通过传递帧序列来实现视频理解:
# gpt4o_video.py
import base64
import cv2
from openai import OpenAI
def extract_frames_base64(
video_path: str,
max_frames: int = 32,
resize: tuple[int, int] = (512, 512),
) -> list[str]:
"""
Extract evenly-spaced frames from video as base64 strings.
GPT-4o token cost: ~85 tokens per low-res image, ~170 for high-res.
32 frames at low-res = ~2720 tokens.
"""
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
fps = cap.get(cv2.CAP_PROP_FPS)
# Calculate frame indices to sample
indices = [int(i * total_frames / max_frames) for i in range(max_frames)]
frames_b64 = []
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if not ret:
continue
# Resize for token efficiency
frame = cv2.resize(frame, resize)
# Encode to base64
_, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
b64 = base64.b64encode(buffer).decode('utf-8')
frames_b64.append(b64)
cap.release()
return frames_b64
def gpt4o_video_qa(
video_path: str,
question: str,
max_frames: int = 32,
) -> str:
"""Video QA using GPT-4o with frame sequence."""
client = OpenAI()
frames = extract_frames_base64(video_path, max_frames)
# Build message with interleaved frames
content = [
{
"type": "text",
"text": (
f"I'm showing you {len(frames)} evenly-sampled frames "
f"from a video. Please analyze the video and answer: {question}"
),
},
]
for i, frame_b64 in enumerate(frames):
content.append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{frame_b64}",
"detail": "low", # Save tokens
},
})
response = client.chat.completions.create(
model='gpt-4o',
messages=[{"role": "user", "content": content}],
max_tokens=2048,
temperature=0.2,
)
return response.choices[0].message.content
2.3 两种方案对比
| 特性 | Gemini (原生视频) | GPT-4o (帧序列) |
|---|---|---|
| 输入方式 | 直接上传视频文件 | 手动抽帧为图片序列 |
| 时间感知 | 原生理解时间线 | 依赖帧间推理 |
| 音频理解 | 支持语音+音效 | 不支持(需额外 Whisper) |
| Token 成本 | 按视频时长计费 | 按图片数量计费 |
| 最大时长 | ~1 小时 | 受 token 限制(~50 帧) |
| 精度 | 时间定位更准确 | 粗粒度(取决于抽帧密度) |
三、时序定位(Temporal Grounding)
时序定位回答的问题是:"视频中的第 X 分 Y 秒到第 M 分 N 秒发生了什么?"或者反过来:"某个事件发生在视频的什么时刻?"
3.1 基于 LLM 的时序定位
# temporal_grounding.py
def temporal_grounding(
video_path: str,
query: str,
model: str = 'gemini-2.5-flash',
) -> list[dict]:
"""
Find time segments in video matching a natural language query.
Returns list of {start, end, description, confidence}.
"""
prompt = f"""Analyze this video and find all segments that match: "{query}"
For each matching segment, provide:
- start_time: in seconds (float)
- end_time: in seconds (float)
- description: what happens in this segment
- confidence: how confident you are (0.0-1.0)
Output as JSON array. If no segments match, return empty array [].
Example:
[
{{"start_time": 12.5, "end_time": 18.0, "description": "...", "confidence": 0.9}}
]"""
video_file = upload_video(video_path)
model_instance = genai.GenerativeModel(model)
response = model_instance.generate_content(
[video_file, prompt],
generation_config=genai.GenerationConfig(
temperature=0.1,
response_mime_type='application/json',
),
)
import json
segments = json.loads(response.text)
return segments
def create_highlight_reel(
video_path: str,
segments: list[dict],
output_path: str,
) -> None:
"""
Create a highlight video from temporal grounding results.
Extracts and concatenates matching segments.
"""
import subprocess
import tempfile
temp_clips = []
for i, seg in enumerate(segments):
clip_path = f"/tmp/highlight_{i}.mp4"
cmd = [
'ffmpeg', '-y',
'-ss', str(seg['start_time']),
'-i', video_path,
'-t', str(seg['end_time'] - seg['start_time']),
'-c', 'copy',
'-avoid_negative_ts', 'make_zero',
clip_path
]
subprocess.run(cmd, check=True, capture_output=True)
temp_clips.append(clip_path)
# Concatenate clips
list_file = '/tmp/highlight_list.txt'
with open(list_file, 'w') as f:
for clip in temp_clips:
f.write(f"file '{clip}'\n")
cmd = [
'ffmpeg', '-y',
'-f', 'concat', '-safe', '0',
'-i', list_file,
'-c', 'copy',
output_path
]
subprocess.run(cmd, check=True, capture_output=True)
# Cleanup
for clip in temp_clips:
Path(clip).unlink(missing_ok=True)
Path(list_file).unlink(missing_ok=True)
四、动作识别(Action Recognition)
4.1 传统方法 vs 大模型方法
传统动作识别使用专用模型(如 VideoMAE、TimeSformer),在预定义的动作类别上训练。大模型方法则用自然语言描述动作,实现开放词汇的识别。
# action_recognition.py
# Method 1: Specialized model (VideoMAE)
from transformers import VideoMAEForVideoClassification, AutoProcessor
import torch
class VideoMAERecognizer:
"""Action recognition using VideoMAE (fine-tuned on Kinetics-400)."""
def __init__(self):
self.model_name = 'MCG-NJU/videomae-base-finetuned-kinetics'
self.processor = AutoProcessor.from_pretrained(self.model_name)
self.model = VideoMAEForVideoClassification.from_pretrained(
self.model_name
)
def recognize(self, frames: list[np.ndarray]) -> list[dict]:
"""
Classify action from a sequence of frames.
Input: 16 frames, each as numpy array (H, W, 3).
"""
inputs = self.processor(list(frames), return_tensors='pt')
with torch.no_grad():
outputs = self.model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=-1)[0]
# Top 5 predictions
top5_idx = probs.topk(5).indices.tolist()
results = []
for idx in top5_idx:
results.append({
'action': self.model.config.id2label[idx],
'confidence': float(probs[idx]),
})
return results
# Method 2: LLM-based open-vocabulary recognition
def llm_action_recognition(
video_path: str,
context: str = '',
) -> list[dict]:
"""
Open-vocabulary action recognition using multimodal LLM.
No predefined categories needed.
"""
prompt = f"""Analyze the actions/activities happening in this video.
For each distinct action, provide:
- action: descriptive name of the action
- actor: who/what is performing it
- start_time: approximate start in seconds
- end_time: approximate end in seconds
- confidence: your confidence (0.0-1.0)
{f"Context: {context}" if context else ""}
Output as JSON array."""
video_file = upload_video(video_path)
model = genai.GenerativeModel('gemini-2.5-flash')
response = model.generate_content(
[video_file, prompt],
generation_config=genai.GenerationConfig(
temperature=0.1,
response_mime_type='application/json',
),
)
import json
return json.loads(response.text)
4.2 选型建议
| 场景 | 推荐方法 | 原因 |
|---|---|---|
| 固定类别(运动分析) | VideoMAE / 专用模型 | 高精度、低延迟、可离线 |
| 开放类别(内容理解) | Gemini / GPT-4o | 无需训练、灵活性高 |
| 实时检测(安防) | SlowFast + YOLO | 低延迟、可边缘部署 |
| 细粒度分析(体育) | 专用模型 + 姿态估计 | 需要关节级精度 |
五、视频摘要生成
5.1 多层级摘要策略
长视频(如 1 小时会议录像)不能直接丢给 LLM——即使 Gemini 支持长视频,一次性处理的输出质量也会下降。更好的方法是分段处理、层级聚合。
# video_summarizer.py
class HierarchicalSummarizer:
"""
Multi-level video summarization for long videos.
Level 1: Scene-level summaries (per scene)
Level 2: Segment summaries (groups of scenes)
Level 3: Full video summary (aggregation)
"""
def __init__(self, model: str = 'gemini-2.5-flash'):
self.model = genai.GenerativeModel(model)
async def summarize_long_video(
self,
video_path: str,
max_segment_minutes: int = 5,
) -> dict:
"""Full pipeline for long video summarization."""
# Step 1: Detect scenes
scenes = detect_scenes(video_path)
duration_info = probe(video_path)
total_duration = duration_info.duration
# Step 2: Group scenes into segments (~5 min each)
segments = self._group_scenes(
scenes, max_segment_seconds=max_segment_minutes * 60
)
# Step 3: Summarize each segment
segment_summaries = []
for seg in segments:
clip_path = f"/tmp/seg_{seg['index']}.mp4"
trim(video_path, clip_path, seg['start'], seg['duration'])
summary = await self._summarize_segment(clip_path, seg)
segment_summaries.append(summary)
# Step 4: Aggregate into final summary
final_summary = await self._aggregate_summaries(
segment_summaries, total_duration
)
return {
'total_duration': total_duration,
'segments': segment_summaries,
'summary': final_summary,
}
async def _summarize_segment(
self, clip_path: str, segment_info: dict
) -> dict:
"""Summarize a single video segment."""
video_file = upload_video(clip_path)
prompt = (
"Summarize this video segment concisely:\n"
"- Main topics discussed/shown\n"
"- Key information or decisions\n"
"- Notable visual elements\n"
"Keep it under 100 words."
)
response = self.model.generate_content(
[video_file, prompt],
generation_config=genai.GenerationConfig(temperature=0.2),
)
return {
'index': segment_info['index'],
'time_range': f"{segment_info['start']:.0f}s - "
f"{segment_info['start'] + segment_info['duration']:.0f}s",
'summary': response.text,
}
async def _aggregate_summaries(
self, segment_summaries: list[dict], total_duration: float
) -> str:
"""Create final summary from segment summaries."""
segments_text = '\n\n'.join(
f"[{s['time_range']}]\n{s['summary']}"
for s in segment_summaries
)
prompt = f"""Based on these segment summaries of a {total_duration/60:.0f}-minute video,
create a comprehensive summary:
{segments_text}
Provide:
1. Executive summary (2-3 sentences)
2. Key points (5-7 bullet points)
3. Timeline of major topics
4. Action items or conclusions (if applicable)"""
response = self.model.generate_content(
prompt,
generation_config=genai.GenerationConfig(temperature=0.3),
)
return response.text
@staticmethod
def _group_scenes(
scenes: list, max_segment_seconds: int
) -> list[dict]:
"""Group consecutive scenes into segments."""
segments = []
current_start = 0.0
current_duration = 0.0
current_scenes = []
for scene in scenes:
if current_duration + scene.duration > max_segment_seconds:
if current_scenes:
segments.append({
'index': len(segments),
'start': current_start,
'duration': current_duration,
'scene_count': len(current_scenes),
})
current_start = scene.start_time
current_duration = 0.0
current_scenes = []
current_scenes.append(scene)
current_duration += scene.duration
if current_scenes:
segments.append({
'index': len(segments),
'start': current_start,
'duration': current_duration,
'scene_count': len(current_scenes),
})
return segments
六、实际应用场景
6.1 会议录像智能分析
输入: 1小时 Zoom 会议录像
|
v
[Whisper 转写] -> 全文逐字稿 + 时间戳
|
v
[说话人分离 (pyannote)] -> 识别不同参会者
|
v
[LLM 摘要] -> 议题/决策/待办
|
v
输出:
- 会议纪要 (Markdown)
- 按议题索引的视频片段
- 待办事项清单
6.2 教育视频知识图谱
输入: 系列教学视频
|
v
[视频理解] -> 每课知识点提取
|
v
[知识关联] -> 跨课程概念连接
|
v
输出:
- 知识图谱 (concept -> prerequisite -> next_concept)
- 自动生成章节目录
- 基于用户问题的精确片段推荐
6.3 电商视频自动标注
输入: 商品展示视频
|
v
[物体检测] -> 识别商品品类
|
v
[属性提取] -> 颜色/材质/尺寸
|
v
[场景理解] -> 使用场景/风格
|
v
输出:
- 自动生成商品描述
- 关键帧作为商品图片
- SEO 关键词
七、成本与性能考量
Token/价格估算
| 模型 | 视频输入成本 | 1 分钟视频约 |
|---|---|---|
| Gemini 2.5 Flash | 按视频帧数计 | ~$0.01-0.05 |
| Gemini 2.5 Pro | 按视频帧数计 | ~$0.10-0.50 |
| GPT-4o (32 帧) | ~2720 tokens | ~$0.01 |
| GPT-4o (100 帧) | ~8500 tokens | ~$0.04 |
延迟优化策略
- 抽帧密度自适应:静态场景降低帧率,动态场景增加帧率
- 两阶段流程:先用 Flash 模型快速过滤,对需要深度分析的内容再用 Pro 模型
- 缓存:相同视频的分析结果缓存 24 小时
- 预处理:上传时即开始抽帧和 Whisper 转写,不等用户发起查询
质量提升技巧
- 在 prompt 中提供视频的上下文(来源、类型、语言)
- 要求结构化输出(JSON),便于后续处理
- 对关键场景要求模型给出置信度
- 用 Whisper 转写结果补充 LLM 的视觉理解(多模态互补)
八、技术趋势
视频理解正在从"给模型看帧"走向"原生视频推理":
- 更长的视频上下文:Gemini 已支持 1 小时视频,未来将扩展到数小时
- 实时视频理解:结合流式推理,实现直播内容的实时分析
- 视频 Agent:不仅理解视频内容,还能基于理解执行操作(剪辑、标注、回复)
- 多模态检索:用自然语言搜索视频库中的特定片段
视频是信息密度最高的媒介。当 AI 能真正"看懂"视频时,它将解锁从教育到安防、从电商到医疗的无限应用场景。
Maurice | maurice_wen@proton.me