多模态知识图谱构建方法论

原创灵阙教研团队

S 精选进阶 | 约 10 分钟阅读更新于 2026-02-28

AI 导读

多模态知识图谱构建方法论为什么需要多模态知识图谱传统知识图谱主要从文本中抽取结构化知识，但企业级场景中的信息载体远不止文本：产品图片、操作视频、工程图纸、会议录音、监控画面都蕴含丰富的实体和关系。多模态知识图谱（Multimodal Knowledge Graph, MMKG）将文本、图像、视频、音频等多种模态统一编码到同一张图中，实现跨模态的知识检索与推理。多模态知识图谱架构...

多模态知识图谱构建方法论

为什么需要多模态知识图谱

传统知识图谱主要从文本中抽取结构化知识，但企业级场景中的信息载体远不止文本：产品图片、操作视频、工程图纸、会议录音、监控画面都蕴含丰富的实体和关系。多模态知识图谱（Multimodal Knowledge Graph, MMKG）将文本、图像、视频、音频等多种模态统一编码到同一张图中，实现跨模态的知识检索与推理。

多模态知识图谱架构

┌─────────────────────────────────────────────────────────────┐
│                    多模态知识图谱架构                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐          │
│  │  文本    │ │  图像    │ │  视频    │ │  音频    │          │
│  │  模态    │ │  模态    │ │  模态    │ │  模态    │          │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘          │
│       │           │           │           │                │
│  ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐          │
│  │ NLP 管线 │ │ CV 管线  │ │ Video   │ │ ASR +   │          │
│  │ NER+RE  │ │ 检测+OCR │ │ 理解管线 │ │ NLP管线  │          │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘          │
│       │           │           │           │                │
│  ┌────▼───────────▼───────────▼───────────▼────┐           │
│  │           跨模态实体对齐与融合层               │           │
│  │                                              │           │
│  │  实体消歧 + 模态关联 + 置信度融合              │           │
│  └──────────────────┬───────────────────────────┘           │
│                     │                                       │
│  ┌──────────────────▼───────────────────────────┐           │
│  │              统一知识图谱存储层                 │           │
│  │                                              │           │
│  │  节点：实体 + 模态附件（图片/视频/音频链接）    │           │
│  │  边：语义关系 + 视觉关系 + 时序关系            │           │
│  └──────────────────────────────────────────────┘           │
└─────────────────────────────────────────────────────────────┘

统一本体设计

多模态实体模型

-- 核心实体节点（所有模态共享）
(:Entity {
  id: "entity_001",
  name: "产品A",
  type: "Product",
  description: "某电子产品",
  canonical_name: "产品A型号X",
  created_at: datetime(),
  source_modalities: ["text", "image", "video"]  -- 来源模态标记
})

-- 模态附件节点
(:TextMention {
  content: "产品A在2025年发布...",
  document_id: "doc_123",
  offset_start: 45,
  offset_end: 68,
  confidence: 0.92
})

(:ImageRegion {
  image_url: "s3://bucket/img_456.jpg",
  bbox: [120, 80, 340, 260],  -- [x, y, w, h]
  detection_model: "yolo-v8",
  confidence: 0.88,
  visual_embedding: [0.12, -0.34, ...]  -- 512维向量
})

(:VideoSegment {
  video_url: "s3://bucket/vid_789.mp4",
  start_time: 12.5,
  end_time: 18.3,
  keyframe_url: "s3://bucket/frame_789_12.jpg",
  action_label: "组装操作",
  confidence: 0.85
})

(:AudioSegment {
  audio_url: "s3://bucket/aud_101.wav",
  start_time: 0.0,
  end_time: 5.2,
  transcript: "下面介绍产品A的特性...",
  speaker_id: "speaker_01",
  confidence: 0.90
})

模态关联关系

-- 实体到模态附件的关联
(entity)-[:MENTIONED_IN]->(text_mention)
(entity)-[:DEPICTED_IN]->(image_region)
(entity)-[:APPEARS_IN]->(video_segment)
(entity)-[:REFERRED_IN]->(audio_segment)

-- 跨模态对齐关系
(text_mention)-[:ALIGNED_WITH {score: 0.91}]->(image_region)
(video_segment)-[:SYNCED_WITH]->(audio_segment)
(image_region)-[:EXTRACTED_FROM]->(video_segment)

图像模态处理管线

物体检测与识别

from ultralytics import YOLO
from PIL import Image
import torch

class ImageKGExtractor:
    def __init__(self):
        self.detector = YOLO("yolov8x.pt")
        self.ocr_model = None  # 按需加载

    def extract_entities(self, image_path: str) -> list[dict]:
        """从图像中提取实体"""
        results = self.detector(image_path, conf=0.5)
        entities = []

        for result in results:
            for box in result.boxes:
                entities.append({
                    "type": "visual_entity",
                    "label": result.names[int(box.cls)],
                    "bbox": box.xyxy[0].tolist(),
                    "confidence": float(box.conf),
                    "image_path": image_path
                })

        return entities

    def extract_relations(self, entities: list[dict]) -> list[dict]:
        """基于空间关系抽取视觉关系"""
        relations = []
        for i, e1 in enumerate(entities):
            for j, e2 in enumerate(entities):
                if i >= j:
                    continue
                spatial_rel = self._compute_spatial_relation(e1["bbox"], e2["bbox"])
                if spatial_rel:
                    relations.append({
                        "subject": e1["label"],
                        "predicate": spatial_rel,
                        "object": e2["label"],
                        "source": "visual"
                    })
        return relations

    def _compute_spatial_relation(
        self, bbox1: list, bbox2: list
    ) -> str | None:
        """计算空间关系"""
        x1_center = (bbox1[0] + bbox1[2]) / 2
        y1_center = (bbox1[1] + bbox1[3]) / 2
        x2_center = (bbox2[0] + bbox2[2]) / 2
        y2_center = (bbox2[1] + bbox2[3]) / 2

        dx = x2_center - x1_center
        dy = y2_center - y1_center

        # 包含关系
        if (bbox1[0] <= bbox2[0] and bbox1[1] <= bbox2[1] and
            bbox1[2] >= bbox2[2] and bbox1[3] >= bbox2[3]):
            return "包含"

        # 方位关系
        if abs(dx) > abs(dy):
            return "右侧" if dx > 0 else "左侧"
        else:
            return "下方" if dy > 0 else "上方"

多模态 LLM 图像理解

import base64

def image_to_triples_vlm(image_path: str) -> dict:
    """使用多模态 LLM 从图像抽取三元组"""
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """
分析这张图片，提取所有可识别的实体和关系。

输出 JSON 格式：
{
  "entities": [{"name": "名称", "type": "类型", "visual_description": "视觉描述"}],
  "relations": [{"subject": "主体", "predicate": "关系", "object": "客体"}],
  "scene_description": "场景整体描述"
}
"""
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}
                    }
                ]
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.0
    )
    return json.loads(response.choices[0].message.content)

视频模态处理管线

关键帧提取与分析

import cv2
import numpy as np

class VideoKGExtractor:
    def __init__(self, sample_interval: float = 2.0):
        self.sample_interval = sample_interval  # 秒
        self.image_extractor = ImageKGExtractor()

    def extract_keyframes(self, video_path: str) -> list[dict]:
        """提取关键帧并分析"""
        cap = cv2.VideoCapture(video_path)
        fps = cap.get(cv2.CAP_PROP_FPS)
        frame_interval = int(fps * self.sample_interval)

        keyframes = []
        frame_idx = 0

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            if frame_idx % frame_interval == 0:
                timestamp = frame_idx / fps
                frame_path = f"/tmp/frame_{frame_idx}.jpg"
                cv2.imwrite(frame_path, frame)

                keyframes.append({
                    "frame_idx": frame_idx,
                    "timestamp": timestamp,
                    "path": frame_path
                })

            frame_idx += 1

        cap.release()
        return keyframes

    def extract_temporal_relations(
        self, keyframe_entities: list[list[dict]]
    ) -> list[dict]:
        """从关键帧序列中提取时序关系"""
        temporal_relations = []

        for i in range(len(keyframe_entities) - 1):
            current = {e["label"] for e in keyframe_entities[i]}
            next_frame = {e["label"] for e in keyframe_entities[i + 1]}

            # 新出现的实体
            appeared = next_frame - current
            for entity in appeared:
                temporal_relations.append({
                    "subject": entity,
                    "predicate": "出现于",
                    "object": f"时刻_{i + 1}",
                    "source": "temporal"
                })

            # 消失的实体
            disappeared = current - next_frame
            for entity in disappeared:
                temporal_relations.append({
                    "subject": entity,
                    "predicate": "消失于",
                    "object": f"时刻_{i + 1}",
                    "source": "temporal"
                })

        return temporal_relations

视频场景图构建

时刻 T1                    时刻 T2                    时刻 T3
┌──────────────┐          ┌──────────────┐          ┌──────────────┐
│  [人员A]     │          │  [人员A]     │          │  [人员A]     │
│    │操作      │    ──→   │    │组装      │    ──→   │    │检测      │
│  [设备X]     │          │  [零件Y]     │          │  [成品Z]     │
│    │位于      │          │    │装入      │          │    │位于      │
│  [工位1]     │          │  [设备X]     │          │  [传送带]    │
└──────────────┘          └──────────────┘          └──────────────┘

音频模态处理管线

import whisper

class AudioKGExtractor:
    def __init__(self):
        self.asr_model = whisper.load_model("large-v3")

    def transcribe_and_extract(self, audio_path: str) -> dict:
        """语音转文本 + 实体关系抽取"""
        # 1. ASR 转写
        result = self.asr_model.transcribe(
            audio_path,
            language="zh",
            word_timestamps=True
        )

        # 2. 分段提取
        segments = []
        for seg in result["segments"]:
            segments.append({
                "text": seg["text"],
                "start": seg["start"],
                "end": seg["end"]
            })

        # 3. 对每段文本做实体关系抽取
        all_triples = []
        for seg in segments:
            triples = joint_extract(seg["text"])
            for t in triples.get("triples", []):
                t["timestamp_start"] = seg["start"]
                t["timestamp_end"] = seg["end"]
                t["source"] = "audio"
            all_triples.extend(triples.get("triples", []))

        return {
            "transcript": result["text"],
            "segments": segments,
            "triples": all_triples
        }

跨模态实体对齐

对齐策略

跨模态实体对齐是 MMKG 的核心难点：同一个实体可能在文本中以名称出现，在图像中以视觉特征出现，在视频中以动作序列出现。

┌─────────────────────────────────────────────────┐
│             跨模态实体对齐策略                     │
├─────────────────────────────────────────────────┤
│                                                 │
│  1. 名称匹配（Text <-> Text from OCR/ASR）       │
│     "华为手机" == OCR("华为手机") ──→ 精确对齐    │
│                                                 │
│  2. 语义匹配（Text <-> Image Caption）           │
│     "智能手表" ~ Caption("银色圆形手表") ──→ 近似  │
│                                                 │
│  3. 视觉匹配（Image <-> Image）                  │
│     同一物体不同角度 ──→ 视觉特征 cosine > 0.85   │
│                                                 │
│  4. 时空匹配（Video <-> Audio）                  │
│     同一时间戳出现 ──→ 时序对齐                    │
│                                                 │
│  5. 共现统计（任意模态对）                        │
│     频繁共同出现 ──→ 高概率是同一实体              │
└─────────────────────────────────────────────────┘

多模态 Embedding 对齐

from transformers import CLIPModel, CLIPProcessor

class CrossModalAligner:
    def __init__(self):
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

    def compute_cross_modal_similarity(
        self,
        text: str,
        image_path: str
    ) -> float:
        """计算文本-图像跨模态相似度"""
        image = Image.open(image_path)
        inputs = self.clip_processor(
            text=[text],
            images=[image],
            return_tensors="pt",
            padding=True
        )

        outputs = self.clip_model(**inputs)
        text_emb = outputs.text_embeds
        image_emb = outputs.image_embeds

        similarity = torch.cosine_similarity(text_emb, image_emb).item()
        return similarity

    def align_entities(
        self,
        text_entities: list[dict],
        visual_entities: list[dict],
        threshold: float = 0.75
    ) -> list[dict]:
        """跨模态实体对齐"""
        alignments = []

        for te in text_entities:
            best_match = None
            best_score = 0

            for ve in visual_entities:
                if ve.get("image_path"):
                    score = self.compute_cross_modal_similarity(
                        te["name"],
                        ve["image_path"]
                    )
                    if score > best_score:
                        best_score = score
                        best_match = ve

            if best_match and best_score >= threshold:
                alignments.append({
                    "text_entity": te,
                    "visual_entity": best_match,
                    "alignment_score": best_score,
                    "method": "CLIP_similarity"
                })

        return alignments

统一存储方案

Neo4j 多模态图谱入库

-- 创建多模态实体
MERGE (e:Entity {id: $entity_id})
SET e.name = $name,
    e.type = $type,
    e.modalities = $modalities

-- 关联文本提及
CREATE (tm:TextMention {
  content: $text,
  doc_id: $doc_id,
  confidence: $conf
})
MERGE (e)-[:MENTIONED_IN]->(tm)

-- 关联图像区域
CREATE (ir:ImageRegion {
  image_url: $img_url,
  bbox: $bbox,
  confidence: $conf
})
MERGE (e)-[:DEPICTED_IN]->(ir)

-- 关联视频片段
CREATE (vs:VideoSegment {
  video_url: $vid_url,
  start_time: $start,
  end_time: $end
})
MERGE (e)-[:APPEARS_IN]->(vs)

-- 跨模态对齐边
MATCH (tm:TextMention {content: $text_content})
MATCH (ir:ImageRegion {image_url: $img_url})
MERGE (tm)-[:ALIGNED_WITH {score: $score, method: "CLIP"}]->(ir)

向量索引支持跨模态检索

-- 为图像区域创建视觉向量索引
CREATE VECTOR INDEX visual_embedding_idx
FOR (ir:ImageRegion) ON (ir.visual_embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 512, `vector.similarity_function`: 'cosine'}};

-- 跨模态检索：文本查图像
WITH $text_embedding AS query_vec
CALL db.index.vector.queryNodes('visual_embedding_idx', 10, query_vec)
YIELD node, score
RETURN node.image_url, score;

应用场景

场景一：智能制造质检

文本报告 ──→ "产品X表面有划痕"
                  │
                  ├── 对齐 ──→ 质检图片中的缺陷区域
                  │
                  ├── 对齐 ──→ 产线视频中的异常时刻
                  │
                  └── 关联 ──→ 操作工/设备/工序节点

场景二：医学影像知识图谱

CT 影像 ──→ 病灶检测 ──→ (肺结节, 位于, 右肺上叶)
                              │
电子病历 ──→ NLP 抽取 ──→ (患者A, 诊断为, 肺癌)
                              │
基因报告 ──→ 结构化 ──→ (患者A, 携带, EGFR突变)
                              │
              ┌────────────────┘
              ▼
    统一患者知识图谱 ──→ 辅助诊疗决策

场景三：电商多模态商品图谱

模态	抽取内容	图谱节点/边
商品标题	品牌、型号、属性	(商品)-[:品牌]->(品牌)
商品图片	颜色、材质、场景	(商品)-[:颜色]->(红色)
用户评论	情感、使用场景	(商品)-[:适用于]->(场景)
使用视频	操作步骤、效果	(商品)-[:使用步骤]->(步骤序列)

评估指标

指标	定义	计算方式
模态覆盖率	实体被多少种模态覆盖	平均模态数 / 总模态数
跨模态对齐准确率	对齐是否正确	正确对齐数 / 总对齐数
多模态融合增益	融合后 vs 单模态的质量提升	(融合F1 - 最佳单模态F1) / 最佳单模态F1
检索跨模态命中率	文本查询能否命中图像实体	跨模态命中数 / 总查询数

技术选型

组件	推荐方案	备选
图像理解	GPT-4o / Gemini Pro Vision	YOLOv8 + CLIP
视频分析	Gemini 视频理解	VideoMAE + 关键帧分析
语音转写	Whisper large-v3	阿里 FunASR
跨模态对齐	CLIP / SigLIP	Chinese-CLIP
多模态 Embedding	BGE-M3	JINA-CLIP
图数据库	Neo4j + 向量索引	NebulaGraph

总结

多模态知识图谱建设的核心要点：

统一本体先行：先设计好实体模型和模态附件的关联关系，再做各模态的抽取
分模态管线独立：每种模态的处理管线独立开发、独立评估，最后在对齐层融合
CLIP 是跨模态对齐的基石：文本-图像对齐首选 CLIP 系模型
VLM 简化管线：GPT-4o 等多模态大模型可以直接从图像/视频抽取三元组，大幅简化传统 CV 管线
增量融合：支持逐步添加新模态，不要求所有模态同时就绪

Maurice | maurice_wen@proton.me