Neo4j实战:AI应用中的图数据库

Cypher查询基础、图建模模式、GDS算法、向量索引与LLM集成工程指南

引言

图数据库在AI应用中的角色正在从"辅助存储"升级为"核心推理引擎"。Neo4j作为图数据库领域的领导者,通过原生向量索引、图数据科学库(GDS)和LLM集成工具链,成为知识图谱、GraphRAG和智能推荐系统的首选技术栈。本文将从Cypher基础到高级AI集成,覆盖Neo4j在AI应用中的完整实践路径。

Cypher查询基础

核心语法

Cypher是Neo4j的声明式图查询语言,其设计哲学是"让图查询像画图一样直观"。

// === 节点和关系创建 ===

// 创建节点
CREATE (p:Person {name: 'Zhang Wei', age: 35, title: 'CTO'})
CREATE (c:Company {name: 'TechCorp', founded: 2020, industry: 'AI'})

// 创建关系
MATCH (p:Person {name: 'Zhang Wei'}), (c:Company {name: 'TechCorp'})
CREATE (p)-[:WORKS_AT {since: 2021, role: 'CTO'}]->(c)

// === 查询模式 ===

// 基础匹配
MATCH (p:Person)-[:WORKS_AT]->(c:Company)
WHERE c.industry = 'AI'
RETURN p.name, c.name, p.title

// 多跳查询(朋友的朋友)
MATCH (a:Person {name: 'Zhang Wei'})-[:KNOWS*2]->(fof:Person)
WHERE fof <> a
RETURN DISTINCT fof.name

// 变长路径
MATCH path = (a:Person)-[:REPORTS_TO*1..5]->(ceo:Person {title: 'CEO'})
RETURN path, length(path) AS depth

// 最短路径
MATCH path = shortestPath(
    (a:Person {name: 'Zhang Wei'})-[:KNOWS*]-(b:Person {name: 'Li Na'})
)
RETURN path, length(path)

// 聚合与排序
MATCH (c:Company)<-[:WORKS_AT]-(p:Person)
RETURN c.name, count(p) AS employee_count, collect(p.name) AS employees
ORDER BY employee_count DESC
LIMIT 10

常用查询模式

// 模式1: 推荐(协同过滤)
// 找到和我看过相同电影的人,推荐他们看过但我没看过的电影
MATCH (me:Person {name: 'Zhang Wei'})-[:WATCHED]->(m:Movie)<-[:WATCHED]-(other:Person)
MATCH (other)-[:WATCHED]->(rec:Movie)
WHERE NOT (me)-[:WATCHED]->(rec)
RETURN rec.title, count(other) AS score
ORDER BY score DESC
LIMIT 10

// 模式2: 影响力传播(PageRank应用场景)
MATCH (p:Person)-[:FOLLOWS]->(target:Person)
WITH target, count(p) AS followers
ORDER BY followers DESC
RETURN target.name, followers
LIMIT 20

// 模式3: 社区检测
MATCH (p:Person)-[:WORKS_AT]->(c:Company)-[:LOCATED_IN]->(city:City)
RETURN city.name, collect(DISTINCT c.name) AS companies,
       count(DISTINCT p) AS talent_pool
ORDER BY talent_pool DESC

// 模式4: 知识图谱问答
MATCH (e:Entity {name: $entity_name})-[r]->(related)
RETURN type(r) AS relation, related.name, labels(related) AS types

图建模模式

AI应用的典型图模型

知识图谱模型

(Person)─────[:WORKS_AT]────→(Company)
   │                              │
   │[:KNOWS]                      │[:LOCATED_IN]
   │                              │
   ▼                              ▼
(Person)                      (Location)
   │
   │[:AUTHORED]
   │
   ▼
(Paper)─────[:CITES]─────→(Paper)
   │
   │[:ABOUT]
   │
   ▼
(Topic)─────[:SUBTOPIC_OF]──→(Topic)


RAG知识库模型

(Document)───[:HAS_CHUNK]───→(Chunk)
    │                           │
    │[:IN_COLLECTION]           │[:SIMILAR_TO]
    │                           │
    ▼                           ▼
(Collection)                 (Chunk)
    │                           │
    │                           │[:MENTIONS]
    │                           │
    │                           ▼
    │                       (Entity)
    │                           │
    │                           │[:RELATED_TO]
    │                           ▼
    │                       (Entity)
    └───[:TAGGED_WITH]────→(Tag)

建模最佳实践

原则 说明 示例
实体→节点 独立存在的事物建模为节点 Person, Company, Product
动作→关系 实体间的动态联系建模为关系 WORKS_AT, PURCHASED
属性→属性 描述性信息附加为属性 name, created_date
中间实体 N:M关系携带丰富属性时提升为节点 Order, Transaction
避免超级节点 单个节点关系数不宜超过10万 按时间/地区分片

GDS图算法

常用算法分类

# Neo4j GDS algorithm categories and use cases
gds_algorithms = {
    "centrality": {
        "PageRank": {
            "use_case": "Identify influential nodes",
            "complexity": "O(V + E)",
            "ai_application": "Entity importance ranking in KG",
        },
        "Betweenness": {
            "use_case": "Find bridge nodes",
            "complexity": "O(V * E)",
            "ai_application": "Key connector identification",
        },
        "Degree": {
            "use_case": "Count connections",
            "complexity": "O(V)",
            "ai_application": "Popularity/activity scoring",
        },
    },
    "community_detection": {
        "Louvain": {
            "use_case": "Find communities/clusters",
            "complexity": "O(V * log V)",
            "ai_application": "Topic clustering, user segmentation",
        },
        "Label Propagation": {
            "use_case": "Fast community detection",
            "complexity": "O(V + E)",
            "ai_application": "Real-time community assignment",
        },
    },
    "similarity": {
        "Node Similarity": {
            "use_case": "Find similar nodes by neighbors",
            "complexity": "O(V^2)",
            "ai_application": "Recommendation, entity matching",
        },
        "KNN": {
            "use_case": "K nearest neighbors",
            "complexity": "O(V * log V)",
            "ai_application": "Vector similarity search",
        },
    },
    "path_finding": {
        "Shortest Path": {
            "use_case": "Find optimal paths",
            "complexity": "O(V + E)",
            "ai_application": "Relationship explanation",
        },
        "All Shortest Paths": {
            "use_case": "Find all optimal paths",
            "complexity": "O(V * E)",
            "ai_application": "Multi-hop reasoning",
        },
    },
}

GDS实战:GraphRAG中的社区摘要

// Step 1: Create graph projection
CALL gds.graph.project(
    'knowledge-graph',
    ['Entity', 'Chunk'],
    {
        RELATED_TO: {orientation: 'UNDIRECTED'},
        MENTIONS: {orientation: 'UNDIRECTED'}
    }
)

// Step 2: Run Louvain community detection
CALL gds.louvain.write('knowledge-graph', {
    writeProperty: 'community_id',
    maxLevels: 10,
    maxIterations: 20
})
YIELD communityCount, modularity

// Step 3: Get community members for LLM summarization
MATCH (e:Entity)
WITH e.community_id AS community, collect(e.name) AS members
WHERE size(members) >= 3
RETURN community, members, size(members) AS size
ORDER BY size DESC
LIMIT 50

// Step 4: Run PageRank within communities
CALL gds.pageRank.write('knowledge-graph', {
    writeProperty: 'pagerank',
    maxIterations: 20,
    dampingFactor: 0.85
})

向量索引

Neo4j原生向量搜索

从Neo4j 5.11开始,原生支持向量索引,这使得Neo4j可以同时承担图数据库和向量数据库的双重角色。

// 创建向量索引
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
    indexConfig: {
        `vector.dimensions`: 1536,
        `vector.similarity_function`: 'cosine'
    }
}

// 写入带向量的节点
CREATE (c:Chunk {
    text: 'Knowledge graphs enable structured reasoning...',
    source: 'paper_001',
    embedding: $embedding_vector
})

// 向量相似搜索
CALL db.index.vector.queryNodes(
    'chunk_embeddings',
    10,                    -- top-K
    $query_embedding       -- query vector
)
YIELD node, score
RETURN node.text, node.source, score
ORDER BY score DESC

// 混合查询: 向量搜索 + 图遍历
CALL db.index.vector.queryNodes('chunk_embeddings', 20, $query_embedding)
YIELD node AS chunk, score
WHERE score > 0.7
MATCH (chunk)-[:MENTIONS]->(entity:Entity)-[:RELATED_TO]-(related:Entity)
RETURN chunk.text, score,
       collect(DISTINCT entity.name) AS entities,
       collect(DISTINCT related.name) AS related_entities
ORDER BY score DESC
LIMIT 5

LLM集成

Neo4j + LangChain GraphRAG

from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import GraphCypherQAChain

# Initialize Neo4j connection
graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="password",
)

# Vector store on top of Neo4j
vector_store = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    url="bolt://localhost:7687",
    username="neo4j",
    password="password",
    index_name="chunk_embeddings",
    node_label="Chunk",
    text_node_properties=["text"],
    embedding_node_property="embedding",
)

# Text-to-Cypher chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
cypher_chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True,
    validate_cypher=True,
    return_intermediate_steps=True,
)

# Query: natural language -> Cypher -> answer
result = cypher_chain.invoke({
    "query": "Which companies in Beijing have more than 100 employees working on AI?"
})
print(result["result"])
# Intermediate: MATCH (c:Company)-[:LOCATED_IN]->(:City {name:'Beijing'})
#               MATCH (c)<-[:WORKS_AT]-(p:Person)
#               WHERE c.industry = 'AI'
#               WITH c, count(p) AS emp_count
#               WHERE emp_count > 100
#               RETURN c.name, emp_count

混合检索:向量+图结构

class HybridGraphRetriever:
    """Combine vector similarity with graph traversal for RAG."""

    def __init__(self, driver, embed_fn):
        self.driver = driver
        self.embed_fn = embed_fn

    def retrieve(self, query: str, top_k: int = 5,
                 graph_depth: int = 1) -> list[dict]:
        """
        1. Vector search for relevant chunks
        2. Graph traversal to enrich context
        3. Combine and rank
        """
        query_embedding = self.embed_fn(query)

        with self.driver.session() as session:
            result = session.run("""
                // Step 1: Vector search
                CALL db.index.vector.queryNodes(
                    'chunk_embeddings', $top_k * 3, $embedding
                )
                YIELD node AS chunk, score
                WHERE score > 0.6

                // Step 2: Get mentioned entities
                OPTIONAL MATCH (chunk)-[:MENTIONS]->(entity:Entity)

                // Step 3: Get related entities (1-hop)
                OPTIONAL MATCH (entity)-[:RELATED_TO]-(related:Entity)

                // Step 4: Get chunks mentioning related entities
                OPTIONAL MATCH (related_chunk:Chunk)-[:MENTIONS]->(related)
                WHERE related_chunk <> chunk

                RETURN chunk.text AS text,
                       score,
                       collect(DISTINCT entity.name) AS entities,
                       collect(DISTINCT related.name) AS related_entities,
                       collect(DISTINCT related_chunk.text)[0..2] AS related_texts
                ORDER BY score DESC
                LIMIT $top_k
            """, embedding=query_embedding, top_k=top_k)

            return [dict(record) for record in result]

性能优化

索引策略

索引类型 用途 创建语法
B-tree 精确匹配/范围查询 CREATE INDEX FOR (n:Label) ON (n.prop)
Full-text 文本搜索 CREATE FULLTEXT INDEX FOR (n:Label) ON EACH [n.text]
Vector 向量相似度 CREATE VECTOR INDEX ... OPTIONS {vector.dimensions: N}
Composite 多属性联合 CREATE INDEX FOR (n:L) ON (n.a, n.b)

查询优化清单

// 1. 使用 PROFILE 分析查询计划
PROFILE
MATCH (p:Person)-[:WORKS_AT]->(c:Company {industry: 'AI'})
RETURN p.name, c.name

// 2. 避免笛卡尔积
// Bad: 无连接条件的多 MATCH
MATCH (a:Person), (b:Company)  -- Cartesian product!

// Good: 通过关系连接
MATCH (a:Person)-[:WORKS_AT]->(b:Company)

// 3. 尽早过滤
// Bad:
MATCH (p:Person)-[*1..5]->(target)
WHERE p.name = 'Zhang Wei'

// Good:
MATCH (p:Person {name: 'Zhang Wei'})-[*1..5]->(target)

// 4. 使用参数化查询
// Good: 利用查询缓存
MATCH (p:Person {name: $name}) RETURN p

结论

Neo4j在AI应用中的价值正在从"存储图数据"扩展为"图原生推理引擎"。原生向量索引让Neo4j可以同时承担向量数据库和图数据库的双重角色,GDS算法库为社区检测和实体重要性排序提供了高效工具,而与LangChain等框架的深度集成则大幅降低了GraphRAG的开发门槛。对于需要结合结构化关系推理和语义检索的AI应用,Neo4j是当前最成熟的技术选择。


Maurice | maurice_wen@proton.me